Notes on OOD Generalization

Raw notes on OOD generalization.
Aug 4, 2022Last updated: May 27, 2023

The greatest challenge is knowing precisely what shifts are possible. It seems a tough problem to develop generalized or automated methods for.

The important aspect to remember is that there are many statistically equivalent decomposition of the data generating process, even when things are not causally conditional. If p(x)p(x) does not change, it does not matter, and the model can perform well regardless of the underlying beliefs. Further, models that satisfy the Kolmogorov consistency (e.g. Gaussian processes but not a probabilistic interpretation of SVM objective; see Quionero-Candela et. al.1 Section 1.4), the distribution of the covariates has no bearing on the conditional model by virtue of marginalization. But if the true causal model is different than assumed, and p(x)p(x) changes, then this implies an interventional change which will implicitly cause a change in p(yx)p(y\mid x). Finally, considering only the probabilistic model, without its decision-theoretic implications can be dangerous, e.g. models that don't factor for risk.

Dataset shift problems may come from mistakenly ignoring some features of the data. Including those features would essentially convert the dataset shift problem to simply the covariate shift problem. For instance, it appears adversarially easy to convert an auditory signal of "sixteen" to "sixty", but would be harder to convert visuals the same way. I think this highlights the importance of representation learning as an important ingredient of OOD robustness in itself.

Ovadia et. al.2 find that calibration on the validation set leads to well-calibrated predictions on the test set. Further, while temperature scaling can achieve low ECE for low values of shift, the ECE inceases significantly as the shift increases. They conclude that the calibration on the validation set does not guarantee calibration under distributional shift. Post-hoc temperature scaling doesn't necessarily fix this, and deep ensembles consistently provide higher predictive entropy on shifted datasets.

In a surprising twist though, Miller et. al.3 show that for a wide variety of (non-Bayesian) models, datasets, and hyperparameters, there is a strong linear correlation between in-distribution and out-of-distribution generalization performance. Hyperparameter tuning, early stopping, or changing the amount of i.i.d training data moves the models along the trend line, but does not alter the linear fit. The linear trends are constructed via probit scaling for more precise fits. Although, there are exceptions. Not all distribution shifts via corruptions on the CIFAR-10-C show a good linear fit (low R2\mathrm{R}^2 values). Sometimes the trend lines change, i.e. the models do not move along an existing trend line.

Filos et. al.4 make the case that UCI datasets overfit very easily, and as a consequence the robust and scalable Bayesian inference methods have not made good progress. There is a qualitative difference between approximate inference with small and large models, which is not highlighted by the simpler benchmarks. Clearly, there is a gap.

Azulay et. al.5 ask why neither the convolutional architecture nor data augmentation are sufficient to achieve the designed invariances. Architectures ignore the classic sampling theorem, and data augmentation does not give invariance because the CNNs learn to be invariant to transforms that are very similar to the typical images from the training set. It is realized that the translation invariance actually does not hold for CNNs due to the subsampling steps (it holds in "literal" terms only at certain factors)

Sagawa et. al.6 argue that regularization may not be too important, such that models can be trained longer and generalize better "on average", but for good "worst-case" performance, regularization is important. The experiments show that by departing from the vanishingly-training-loss regime by (i) very strong 2\ell_2 regularization, and (ii) very early stopping, allows DRO models to significantly outperform ERM models on the worst-group test accuracy while maintaining high average accuracy. The authors propose a group-adjusted DRO estimator which is effectively a C/ngC/\sqrt{n_g} term where CC is a hyperparameter and ngn_g is the group size, intuitively motivated by the fact that smaller groups are more prone to overfitting.

Arjovsky et. al.7 suggest that failing to generalize out-of-distribution is failing to capture the causal factors of variation in data, clinging instead to easier-to-fit spurious correlations, which are prone to change from training to testing domains.

Kirichenko et. al.8 argue that normalizing flows based on affine coupling layers are biased towards learning the low-level properties of the data such as local pixel correlations rather than semantic properties of the data. The maximum likelihood objective has a limited influence on OOD detection, relative to the inductive biases of the flow. Further, they show that the coupling layers co-adapt to make predictions. The simple fix around this issue is to change the masking strategy in the coupling layers to a cycle mask or add an information bottleneck to prevent coupling layer co-adaptation. More interestingly, using pretrained embeddings for natural images, or with the tabular datasets, none of these issues arise.

Gulrajani et. al.9 perform an exhaustive set of experiments to conclude that much of the progress over the last decade is beat by modern ERM itself. Table 2 provides a nice summary of all the learning-from-data problems we have to date, and the rest of the work aims to bring model selection to the fore. The main recommendation is that a domain generalization algorithm should be responsible for model selection. The authors provide a good set of datasets via their benchmark DOMAINBED. The authors also mostly identify four big ways to induce invariance for generalization:

  • learning invariant features,
  • sharing parameters,
  • meta-learning,
  • or performing data augmentation.

Lan et. al.10 argue for the case that perfect density models cannot guarantee anomaly detection by showing that discrepancies like curse of dimensionality and distribution mismatch exist even with perfect densit models, and therefore goes beyond issues of estimation, approximation, or optimization errors. Two cases are highlighted - (i) one can map any distribution to a uniform distributions using an invertible transformation, but outlier detection in a uniform distribution is impossible. But an invertible transform preserves information, and therefore outlier detection should at least be as hard in any other representation different from uniform too! (ii) Along similar lines, there exist reparametrizations such that density under new representations matches a desired arbitrary score to mislead the density-based method into wrong classification of anomalies.

With high-fidelity approximate inference using full-batch HMC, Izmailove et. al.11 have eventually shown that under domain shift, SGLD/SGHMC are closer to deep ensembles (in terms of total variation distance and predictive agreement), than to the HMC posterior predictive. MFVI is even farther away from the HMC posterior. Data augmentation causes trouble with exact Bayesian inference, and therefore it appears reasonable to think that invariances should come through the prior, and not the likelihood term.

12 argues that the lack of robustness under covariate shift for Bayesian neural networks is fundamentally caused by linear dependencies in the inputs. Cold posteriors provide marginal but inconsistent improvements across a set of corruptions. BNNs show most competitive performance in-distribution and on the corruptions representing affine transformations.

The key intuition here is that along the directions of low variance in the eigenspace, the MAP solution would effectively set them to zero, whereas a BNN would still sample from the prior. This makes the MAP solutions more robust. Any perturbations orthogonal to this hyper-plane induced by (nearly-)constant projections will cause trouble with the BNN.

This reminds us of the case of Bayesian Lasso. The Laplace prior does not have enough mass around zero to be meaningfully sparse in practice with a Bayesian treatment. The MAP estimate is then in fact qualitatively very different.

Bommasani et. al.13 talks about the role of Foundation Models in creating models robust to distribution shifts, as existing work has shown that pretraining on unlabeled data is an effective and general purpose way to improve on OOD test distributions.

Ming et. al.14 aims to formalize the problem of OOD detection from the perspective of spurious (test samples that contain only the non-invariant features, e.g. environment) and non-spurious (test samples that contain no relevant features, e.g. outside the class altogether) data. This work provides evidence that spurious correlation hurts OOD detection. Although, the theoretical results tell a mixed story:

The invariant classifier learned does not necessarily only depend on invariant features, and it is possible to learn invariant classifier that relies on the environmental features while achieving lower risk that the optimal invariant classifier.

MEMO15 aims to introduce a plug-and-play method for test-time robustification of models via marginal entropy maximization with one test point.

Optimizing the model for more confident predictions can be justified from the assumption that the true underlying decision boundaries between classes lie in low density regions of the data space


  • Does BMA avoid/solve any of these problems: (i) spurious correlations, (ii) extrapolation, or (iii) temporal shift?
  • Are approximate likelihood maximization and predictive performance fundamentally at odds?
    • Conventional wisdom is that when using a proper scoring rule (like the likelihood), the optimum of the score corresponds to the perfect prediction. But in practice, this isn't working. See scoring and calibration discussion of Ovadia et. al.2
  • Improving fits to HMC doesn't necessarily improve predictive performance. This means the goals should be different?
  • Lan et. al.10 reminds us of the debate around flat-vs-sharp minima. There exist sharp minima which generalize poorly. But SGD appears to be biased towards finding flat minima, partly because they occupy much more volume that it is statistically unlikely to arrive at a sharp minima. Similarly, even though there exist such invertible transformations which map any distribution to a uniform distribution, is finding such a transformation likely under the inductive biases posed by contemporary NN training routines?
  • What is the role of pretraining towards OOD shift?16 Methodological contributions have fallen short, but just simple pretraining works?17
  • Isn't Sagawa et. al.6 essentially hinting towards a case for being a little bit more Bayesian? Particularly, they highlight that training to zero loss makes worse-case group accuracy terrible. This is like over-committing to a single parameter configuration. In addition, it is easier to overfit to smaller groups. Again we have a small-sample problem and quantifying epistemic uncertainty should be important for the predictive distribution.
  • In the spirit of modern work that shows multi-modal datasets can help generalization, can we use image segmentation maps to avoid spurious correlations that rely on non-invariant environmental features like in the background?
  • An Occam's Razor view of the problem of domain generalization?
  • Data augmentation appears to be clearly doing something meaningful when it works. What will automated data augmentation pipelines look like? How do we learn from our objectives what data augmentation is needed? Although, it would also appear that existing data augmentation techniques provide a very weak invariance5.

Practical Bits

  • Reweighting unbalanced classes by relative frequencies in each minibatch.18
  • Pretraining image embeddings provides better OOD detection that training normalizing flows on the raw data.8
  • Lan et. al.10 has a nice visualization comparing density-based versus typicality-based outlier detection.


  • WILDS19
  • BREEDS20
  • Diabetic Retinopath4
  • The Fishyscapes Benchmark21
  • Galaxy Zoo22

Structure & Taxonomy

Possible dataset shifts (See graphical models in Quionero-Candela et. al.1):

  • Simple Covariate Shift (change in input covariates p(x)p(x))
  • Prior Probability Shift (change in target p(y)p(y))
  • Sample Selection Bias (selection process of a sample is dependent on target yy)
  • Imbalanced Data
  • Domain Shift


  1. Quionero-Candela, Joaquin et al. “Dataset Shift in Machine Learning.” (2009). 2

  2. Ovadia, Yaniv et al. “Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift.” ArXiv abs/1906.02530 (2019). 2

  3. Miller, John et al. “Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization.” ArXiv abs/2107.04649 (2021).

  4. Filos, Angelos et al. “A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks.” ArXiv abs/1912.10481 (2019). 2

  5. Azulay, Aharon and Yair Weiss. “Why do deep convolutional networks generalize so poorly to small image transformations?” J. Mach. Learn. Res. 20 (2018): 184:1-184:25. 2

  6. Sagawa, Shiori et al. “Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization.” ArXiv abs/1911.08731 (2019). 2

  7. Arjovsky, Martín et al. “Invariant Risk Minimization.” ArXiv abs/1907.02893 (2019).

  8. Kirichenko, P. et al. “Why Normalizing Flows Fail to Detect Out-of-Distribution Data.” ArXiv abs/2006.08545 (2020). 2

  9. Gulrajani, Ishaan and David Lopez-Paz. “In Search of Lost Domain Generalization.” ArXiv abs/2007.01434 (2020). 2

  10. Lan, Charline Le and Laurent Dinh. “Perfect Density Models Cannot Guarantee Anomaly Detection.” Entropy 23 (2020). 2 3

  11. Izmailov, Pavel et al. “What Are Bayesian Neural Network Posteriors Really Like?” International Conference on Machine Learning (2021).

  12. Izmailov, Pavel et al. “Dangers of Bayesian Model Averaging under Covariate Shift.” Neural Information Processing Systems (2021).

  13. Bommasani, Rishi et al. “On the Opportunities and Risks of Foundation Models.” ArXiv abs/2108.07258 (2021).

  14. Ming, Yifei et al. “On the Impact of Spurious Correlation for Out-of-distribution Detection.” ArXiv abs/2109.05642 (2021).

  15. Zhang, Marvin et al. “MEMO: Test Time Robustness via Adaptation and Augmentation.” ArXiv abs/2110.09506 (2021).

  16. Hendrycks, Dan et al. “Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty.” ArXiv abs/1906.12340 (2019).

  17. Hendrycks, Dan et al. “Pretrained Transformers Improve Out-of-Distribution Robustness.” ArXiv abs/2004.06100 (2020).

  18. Leibig, Christian et al. “Leveraging uncertainty information from deep neural networks for disease detection.” Scientific Reports 7 (2016).

  19. Koh, Pang Wei et al. “WILDS: A Benchmark of in-the-Wild Distribution Shifts.” International Conference on Machine Learning (2020).

  20. Santurkar, Shibani et al. “BREEDS: Benchmarks for Subpopulation Shift.” arXiv: Computer Vision and Pattern Recognition (2020).

  21. Blum, Hermann et al. “The Fishyscapes Benchmark: Measuring Blind Spots in Semantic Segmentation.” International Journal of Computer Vision 129 (2019): 3119 - 3135.

  22. Walmsley, Mike et al. “Galaxy Zoo: Probabilistic Morphology through Bayesian CNNs and Active Learning.” ArXiv abs/1905.07424 (2019).

© 2023 Sanyam Kapoor