Zellner1 shows that under a well-specified model, Bayes' theorem is the optimal information processing rule. All our problems arise for the misspecified setting. We still can do inference, but we get optimality in the KL-projection sense. And since KL-divergence is sensitive to the tails, in a small-data regime, Bayes' rule would be sensitive to outliers. Jaynes2 says Bayes rule is the best thing towards inductive inference and comes from the angle of logic.
The generalized posterior is given by (with recovering the Bayes' rule),
Part of the research community focuses on techniques to find a good .
This idea is also known as power likelihood, coming from the statistics community as a mild way to handle misspecification. When , we are essentially putting more weight on data that the prior, giving posteriors that underestimate uncertainty. One way to choose this value is to ensure that the information gain from a single sample is the same as it would be for the true model. Holmes and Walker3 derive such an update via the expected divergence in Fisher information, and show that their model is closer to the true model. The true model for the purposes of divergence estimation is built via an empirical estimate.
From a loss function perspective, can be regarded as the learning rate of the update rule.
The convergence rates of both methods in the worst care are where is a complexity term slower than the denominator. Both are effectively the same problem, and the grand goal would be find practical values of (since theoretical values don't work in practice).
Grunwald and van Ommen4 show via a simple Bayesian linear regression model in a misspecified model scenario, Bayesian model averaging (BMA) does not learn with fewer samples, but at some point recovers. As conjectured, Bayes never recovers if hypothesis class of models is infinite in size. For finite ones, Bayes does eventually recover, requiring many more samples. This issue also did not have anything to do with outliers v/s inliers. Under model misspecification (and certain other stronger conditions), Bayes concentrates around the closest points in terms of the KL divergence.
Grunwald and van Ommen4 argue that Bayesian model selection, when viewed from the prequential coding MDL (Minimum Description Length) lens, fails because we are expecting to pick a model which minimizes the following negative log-likelihood,
This contains a mixture distribution due to the expectation. Safe Bayesian approaches instead take the mean . This means we are averaging in the parameter space, instead of the probability space, such that we make a prediction using a distribution which is always within the model.
A BDL Panel in 20165 notes that PGMs were effectively meant for the holy-grail of composability. It is only that the inference methods have lagged behind. We don't need uncertainty over everything. Max Welling argues that the true promise of Bayesian methods is model selection.
Adlam et al.6 argue that commonly used priors in BNNs significantly overestimate the aleatoric uncertainty in the labels of classification benchmarks. When specifying a prior over functions, we must account for both epistemic and aleatoric uncertainty. If the aleatoric uncertainty is low, we must favor functions that assign higher probabilities to a single class. Cold posterior then reduces this aleatoric uncertainty. It is pretty straightforward to see this analytically in the GP regression case, practically implying if we estimate aleatoric uncertainty correctly, there is not cold posterior effect, but if we overestimate aleatoric uncertainty, then we need correction.
we believe that there is no reason to expect that initialization schemes which achieve good performance in vanilla NNs will give rise to appropriate priors for BNNs.6
Nabarro et. al.7 try to incorporate data augmentation into a generative model of the data, but cannot get rid of the cold posterior effect still. The general idea is to marginalize out the augmentations using a distribution over augmentations conditional on the sample of interest. The authors also try two variants of enforcing invariance: through averaging the logits or averaging the predictive probabilities. Averaging logits appears to be better at higher temperatures, and cold posterior effect is much less pronounced.
Aitchison8 shows that if we consider a scenario where the dataset curation is considered into effect, we can recover tempered likelihoods from a situation where the label is Y only when every human labeled the input precisely the same way (and otherwise
None), i.e. we using a factorized distribution: over human labelers, and taking , we immediately get the tempered likelihood factor. Interestingly, considering gives us a reparametrized softmax. Although, for full Bayesian inference, we must marginalize over this generative process and therefore the optimal tempering shouldn't be the same. Full Bayesian inference is impossible, because it involves a factor that we do not know about.
Cold posteriors observed "in the wild" are therefore unlikely to arise from a single simple cause; as a result, we do not expect a simple "fix" for cold posteriors.
Zeno et al.12 argue that good priors should be input-dependent.
Fortuin et. al.13 study the cold posterior effect from the lens of misspecified priors. They find that spatially correlated priors are better in CNNs while heavy-tailed priors are better in FCNNs. They also find that when fitting the DOF of a Student-t distribution, the lower layers achieve small , whereas higher layers exhibit large (i.e. closer to a Gaussian). While these priors reduce the cold posterior effect in FCNNs, they make it worse in ResNets. Therefore, a lot of conflicting evidence.
Kapoor et. al.14 show a descriptive theory of how tempering is connected to aleatoric uncertainty.
For an quick introduction to conformal prediction, see a Gentle Introduction to Conformal Prediction.
Chris C. Holmes and Stephen G. Walker. “Assigning a value to a power likelihood in a general Bayesian model.” arXiv: Methodology (2017) https://academic.oup.com/biomet/article/104/2/497/3074978 ↩
Peter D. Grunwald and Thijs van Ommen. “Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It.” arXiv: Statistics Theory (2014) https://arxiv.org/abs/1412.3730 ↩ ↩2
Arthur P. Dempster. “A Generalization of Bayesian Inference.” Classic Works of the Dempster-Shafer Theory of Belief Functions (1968). https://link.springer.com/chapter/10.1007/978-3-540-44792-4_4 ↩