Notes on OOD Generalization
Raw notes on OOD generalization.
Table of Contents
The greatest challenge is knowing precisely what shifts are possible. It seems a tough problem to develop generalized or automated methods for.
The important aspect to remember is that there are many statistically equivalent decomposition of the data generating process, even when things are not causally conditional. If $p(x)$ does not change, it does not matter, and the model can perform well regardless of the underlying beliefs. Further, models that satisfy the Kolmogorov consistency (e.g. Gaussian processes but not a probabilistic interpretation of SVM objective; see QuioneroCandela et. al.^{1} Section 1.4), the distribution of the covariates has no bearing on the conditional model by virtue of marginalization. But if the true causal model is different than assumed, and $p(x)$ changes, then this implies an interventional change which will implicitly cause a change in $p(y\mid x)$. Finally, considering only the probabilistic model, without its decisiontheoretic implications can be dangerous, e.g. models that donât factor for risk.
Dataset shift problems may come from mistakenly ignoring some features of the data. Including those features would essentially convert the dataset shift problem to simply the covariate shift problem. For instance, it appears adversarially easy to convert an auditory signal of âsixteenâ to âsixtyâ, but would be harder to convert visuals the same way. I think this highlights the importance of representation learning as an important ingredient of OOD robustness in itself.
Ovadia et. al.^{2} find that calibration on the validation set leads to wellcalibrated predictions on the test set. Further, while temperature scaling can achieve low ECE for low values of shift, the ECE inceases significantly as the shift increases. They conclude that the calibration on the validation set does not guarantee calibration under distributional shift. Posthoc temperature scaling doesnât necessarily fix this, and deep ensembles consistently provide higher predictive entropy on shifted datasets.
In a surprising twist though, Miller et. al.^{3} show that for a wide variety of (nonBayesian) models, datasets, and hyperparameters, there is a strong linear correlation between indistribution and outofdistribution generalization performance. Hyperparameter tuning, early stopping, or changing the amount of i.i.d training data moves the models along the trend line, but does not alter the linear fit. The linear trends are constructed via probit scaling for more precise fits. Although, there are exceptions. Not all distribution shifts via corruptions on the CIFAR10C show a good linear fit (low $\mathrm{R}^2$ values). Sometimes the trend lines change, i.e. the models do not move along an existing trend line.
Filos et. al.^{4} make the case that UCI datasets overfit very easily, and as a consequence the robust and scalable Bayesian inference methods have not made good progress. There is a qualitative difference between approximate inference with small and large models, which is not highlighted by the simpler benchmarks. Clearly, there is a gap.
Azulay et. al.^{5} ask why neither the convolutional architecture nor data augmentation are sufficient to achieve the designed invariances. Architectures ignore the classic sampling theorem, and data augmentation does not give invariance because the CNNs learn to be invariant to transforms that are very similar to the typical images from the training set. It is realized that the translation invariance actually does not hold for CNNs due to the subsampling steps (it holds in âliteralâ terms only at certain factors)
Sagawa et. al.^{6} argue that regularization may not be too important, such that models can be trained longer and generalize better âon averageâ, but for good âworstcaseâ performance, regularization is important. The experiments show that by departing from the vanishinglytrainingloss regime by (i) very strong $\ell_2$ regularization, and (ii) very early stopping, allows DRO models to significantly outperform ERM models on the worstgroup test accuracy while maintaining high average accuracy. The authors propose a groupadjusted DRO estimator which is effectively a $C/\sqrt{n_g}$ term where $C$ is a hyperparameter and $n_g$ is the group size, intuitively motivated by the fact that smaller groups are more prone to overfitting.
Arjovsky et. al.^{7} suggest that failing to generalize outofdistribution is failing to capture the causal factors of variation in data, clinging instead to easiertofit spurious correlations, which are prone to change from training to testing domains.
Kirichenko et. al.^{8} argue that normalizing flows based on affine coupling layers are biased towards learning the lowlevel properties of the data such as local pixel correlations rather than semantic properties of the data. The maximum likelihood objective has a limited influence on OOD detection, relative to the inductive biases of the flow. Further, they show that the coupling layers coadapt to make predictions. The simple fix around this issue is to change the masking strategy in the coupling layers to a cycle mask or add an information bottleneck to prevent coupling layer coadaptation. More interestingly, using pretrained embeddings for natural images, or with the tabular datasets, none of these issues arise.
Gulrajani et. al.^{9} perform an exhaustive set of experiments to conclude that much of the progress over the last decade is beat by modern ERM itself. Table 2 provides a nice summary of all the learningfromdata problems we have to date, and the rest of the work aims to bring model selection to the fore. The main recommendation is that a domain generalization algorithm should be responsible for model selection. The authors provide a good set of datasets via their benchmark DOMAINBED. The authors also mostly identify four big ways to induce invariance for generalization:
 learning invariant features,
 sharing parameters,
 metalearning,
 or performing data augmentation.
Lan et. al.^{10} argue for the case that perfect density models cannot guarantee anomaly detection by showing that discrepancies like curse of dimensionality and distribution mismatch exist even with perfect densit models, and therefore goes beyond issues of estimation, approximation, or optimization errors. Two cases are highlighted  (i) one can map any distribution to a uniform distributions using an invertible transformation, but outlier detection in a uniform distribution is impossible. But an invertible transform preserves information, and therefore outlier detection should at least be as hard in any other representation different from uniform too! (ii) Along similar lines, there exist reparametrizations such that density under new representations matches a desired arbitrary score to mislead the densitybased method into wrong classification of anomalies.
With highfidelity approximate inference using fullbatch HMC, Izmailove et. al.^{11} have eventually shown that under domain shift, SGLD/SGHMC are closer to deep ensembles (in terms of total variation distance and predictive agreement), than to the HMC posterior predictive. MFVI is even farther away from the HMC posterior. Data augmentation causes trouble with exact Bayesian inference, and therefore it appears reasonable to think that invariances should come through the prior, and not the likelihood term.
^{12} argues that the lack of robustness under covariate shift for Bayesian neural networks is fundamentally caused by linear dependencies in the inputs. Cold posteriors provide marginal but inconsistent improvements across a set of corruptions. BNNs show most competitive performance indistribution and on the corruptions representing affine transformations.
The key intuition here is that along the directions of low variance in the eigenspace, the MAP solution would effectively set them to zero, whereas a BNN would still sample from the prior. This makes the MAP solutions more robust. Any perturbations orthogonal to this hyperplane induced by (nearly)constant projections will cause trouble with the BNN.
This reminds us of the case of Bayesian Lasso. The Laplace prior does not have enough mass around zero to be meaningfully sparse in practice with a Bayesian treatment. The MAP estimate is then in fact qualitatively very different.
Bommasani et. al.^{13} talks about the role of Foundation Models in creating models robust to distribution shifts, as existing work has shown that pretraining on unlabeled data is an effective and general purpose way to improve on OOD test distributions.
Ming et. al.^{14} aims to formalize the problem of OOD detection from the perspective of spurious (test samples that contain only the noninvariant features, e.g. environment) and nonspurious (test samples that contain no relevant features, e.g. outside the class altogether) data. This work provides evidence that spurious correlation hurts OOD detection. Although, the theoretical results tell a mixed story:
The invariant classifier learned does not necessarily only depend on invariant features, and it is possible to learn invariant classifier that relies on the environmental features while achieving lower risk that the optimal invariant classifier.
MEMO^{15} aims to introduce a plugandplay method for testtime robustification of models via marginal entropy maximization with one test point.
Optimizing the model for more confident predictions can be justified from the assumption that the true underlying decision boundaries between classes lie in low density regions of the data space
Thoughts
 Does BMA avoid/solve any of these problems: (i) spurious correlations, (ii) extrapolation, or (iii) temporal shift?
 Are approximate likelihood maximization and predictive performance fundamentally at odds?
 Conventional wisdom is that when using a proper scoring rule (like the likelihood), the optimum of the score corresponds to the perfect prediction. But in practice, this isnât working. See scoring and calibration discussion of Ovadia et. al.^{2}
 Improving fits to HMC doesnât necessarily improve predictive performance. This means the goals should be different?
 Lan et. al.^{10} reminds us of the debate around flatvssharp minima. There exist sharp minima which generalize poorly. But SGD appears to be biased towards finding flat minima, partly because they occupy much more volume that it is statistically unlikely to arrive at a sharp minima. Similarly, even though there exist such invertible transformations which map any distribution to a uniform distribution, is finding such a transformation likely under the inductive biases posed by contemporary NN training routines?
 What is the role of pretraining towards OOD shift?^{16} Methodological contributions have fallen short, but just simple pretraining works?^{17}
 Isnât Sagawa et. al.^{6} essentially hinting towards a case for being a little bit more Bayesian? Particularly, they highlight that training to zero loss makes worsecase group accuracy terrible. This is like overcommitting to a single parameter configuration. In addition, it is easier to overfit to smaller groups. Again we have a smallsample problem and quantifying epistemic uncertainty should be important for the predictive distribution.
 In the spirit of modern work that shows multimodal datasets can help generalization, can we use image segmentation maps to avoid spurious correlations that rely on noninvariant environmental features like in the background?
 An Occamâs Razor view of the problem of domain generalization?
 Data augmentation appears to be clearly doing something meaningful when it works. What will automated data augmentation pipelines look like? How do we learn from our objectives what data augmentation is needed? Although, it would also appear that existing data augmentation techniques provide a very weak invariance^{5}.
Practical Bits
 Reweighting unbalanced classes by relative frequencies in each minibatch.^{18}
 Pretraining image embeddings provides better OOD detection that training normalizing flows on the raw data.^{8}
 Lan et. al.^{10} has a nice visualization comparing densitybased versus typicalitybased outlier detection.
Benchmarks
 WILDS^{19}
 DOMAINBED^{9}
 BREEDS^{20}
 Diabetic Retinopath^{4}
 The Fishyscapes Benchmark^{21}
 Galaxy Zoo^{22}
Structure & Taxonomy
Possible dataset shifts (See graphical models in QuioneroCandela et. al.^{1}):
 Simple Covariate Shift (change in input covariates $p(x)$)
 Prior Probability Shift (change in target $p(y)$)
 Sample Selection Bias (selection process of a sample is dependent on target $y$)
 Imbalanced Data
 Domain Shift
Footnotes

QuioneroCandela, Joaquin et al. âDataset Shift in Machine Learning.â (2009). https://mitpress.mit.edu/9780262545877/datasetshiftinmachinelearning/ â© â©^{2}

Ovadia, Yaniv et al. âCan You Trust Your Modelâs Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift.âÂ ArXivÂ abs/1906.02530 (2019). https://arxiv.org/abs/1906.02530 â© â©^{2}

Miller, John et al. âAccuracy on the Line: on the Strong Correlation Between OutofDistribution and InDistribution Generalization.âÂ ArXivÂ abs/2107.04649 (2021). https://arxiv.org/abs/2107.04649 â©

Filos, Angelos et al. âA Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks.âÂ ArXivÂ abs/1912.10481 (2019). https://arxiv.org/abs/1912.10481 â© â©^{2}

Azulay, Aharon and Yair Weiss. âWhy do deep convolutional networks generalize so poorly to small image transformations?âÂ J. Mach. Learn. Res.Â 20 (2018): 184:1184:25. https://arxiv.org/abs/1805.12177 â© â©^{2}

Sagawa, Shiori et al. âDistributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for WorstCase Generalization.âÂ ArXivÂ abs/1911.08731 (2019). https://arxiv.org/abs/1911.08731 â© â©^{2}

Arjovsky, MartĂn et al. âInvariant Risk Minimization.âÂ ArXivÂ abs/1907.02893 (2019). https://arxiv.org/abs/1907.02893 â©

Kirichenko, P. et al. âWhy Normalizing Flows Fail to Detect OutofDistribution Data.âÂ ArXivÂ abs/2006.08545 (2020). https://arxiv.org/abs/2006.08545 â© â©^{2}

Gulrajani, Ishaan and David LopezPaz. âIn Search of Lost Domain Generalization.âÂ ArXivÂ abs/2007.01434 (2020). https://arxiv.org/abs/2007.01434 â© â©^{2}

Lan, Charline Le and Laurent Dinh. âPerfect Density Models Cannot Guarantee Anomaly Detection.âÂ EntropyÂ 23 (2020). https://pubmed.ncbi.nlm.nih.gov/34945996/ â© â©^{2} â©^{3}

Izmailov, Pavel et al. âWhat Are Bayesian Neural Network Posteriors Really Like?âÂ International Conference on Machine LearningÂ (2021). https://arxiv.org/abs/2104.14421 â©

Izmailov, Pavel et al. âDangers of Bayesian Model Averaging under Covariate Shift.âÂ Neural Information Processing SystemsÂ (2021). https://arxiv.org/abs/2106.11905 â©

Bommasani, Rishi et al. âOn the Opportunities and Risks of Foundation Models.âÂ ArXivÂ abs/2108.07258 (2021). https://arxiv.org/abs/2108.07258 â©

Ming, Yifei et al. âOn the Impact of Spurious Correlation for Outofdistribution Detection.âÂ ArXivÂ abs/2109.05642 (2021). https://arxiv.org/abs/2109.05642 â©

Zhang, Marvin et al. âMEMO: Test Time Robustness via Adaptation and Augmentation.âÂ ArXivÂ abs/2110.09506 (2021). https://arxiv.org/abs/2110.09506 â©

Hendrycks, Dan et al. âUsing SelfSupervised Learning Can Improve Model Robustness and Uncertainty.âÂ ArXivÂ abs/1906.12340 (2019). https://arxiv.org/abs/1906.12340 â©

Hendrycks, Dan et al. âPretrained Transformers Improve OutofDistribution Robustness.âÂ ArXivÂ abs/2004.06100 (2020). https://aclanthology.org/2020.aclmain.244.pdf â©

Leibig, Christian et al. âLeveraging uncertainty information from deep neural networks for disease detection.âÂ Scientific ReportsÂ 7 (2016). https://www.nature.com/articles/s4159801717876z â©

Koh, Pang Wei et al. âWILDS: A Benchmark of intheWild Distribution Shifts.âÂ International Conference on Machine LearningÂ (2020). https://arxiv.org/abs/2012.07421 â©

Santurkar, Shibani et al. âBREEDS: Benchmarks for Subpopulation Shift.âÂ arXiv: Computer Vision and Pattern RecognitionÂ (2020). https://arxiv.org/abs/2008.04859 â©

Blum, Hermann et al. âThe Fishyscapes Benchmark: Measuring Blind Spots in Semantic Segmentation.âÂ International Journal of Computer VisionÂ 129 (2019): 3119  3135. https://link.springer.com/article/10.1007/s11263021015116 â©

Walmsley, Mike et al. âGalaxy Zoo: Probabilistic Morphology through Bayesian CNNs and Active Learning.âÂ ArXivÂ abs/1905.07424 (2019). https://arxiv.org/abs/1905.07424 â©