Diffusion Models
Living document on diffusion models from scratch.
Table of Contents
The basic idea is to pick a forward noise process that converges to some easy to sample distribution , which is taken to be . By sampling , we now reverse the process to produce a sample from the original data distribution.1
Variational Perspective
We consider the known noising process starting from a data sample as , and learn a reverse process as for . We can now build a variational lower bound (also motivated from a multivariate information bottleneck2 perspective here) to the marginal distribution over data as,3
Using an auto-regressive decomposition of the and distributions, the lower bound can be further decomposed as:
We have a usual reconstruction term (one-step latent as in usual amortized VAE), a prior matching term (independent of anything learnable so can be ignored), and a consistency term (between the forward process and the backward process ).
The consistency term above computes expectation over two variables and can be higher variance in practice. The key insight here is to change the conditioning in the forward process as:
The equivalent objective now is:
The reconstruction term is the same, and the prior matching term independent of anything trainable (but also zero under our assumptions). The consistency term is now replaced with a denoising matching term, which only depends on expectation over a single variable.
Now for the terms in the denoising matching part of the objective, because we know that the distributions implied by the noising process are Gaussian, using the Bayes’ rule and reparametrization trick,
by assumption of noise schedule , which could either be fixed4 or learned,5 chosen as a variance preserving scehdule. Under such a noise schedule, we also get , where .
Using such a schedule, we can get a closed-form for mean and variance of (see Eq. (84)3). For the denoising model , we can immediately construct the variance to be the same but the mean is parametrized as . The between two Gaussians is then simply a difference between the means.
By mirroring the specific form of the to , we can simplify to operands in the optimization problem to be simply denoising the input5 at different noise levels as:
Using the definition of signal-to-noise (SNR) ratio as the ratio of mean squared to variance, we can simplify the above objective to
In practice, noting that we can rewrite in in terms of a noise random variable , by the relation and then mirroring the functional form for as earlier, we can instead match the source noise which works better in practice (see Eq. 1153):
SDE Perspective
Another alternative objective takes a score-matching form due to Tweedie’s Formula6, which states that the true mean of an exponential family distribution can be estimated by the maximum likelihood estimate plus a correction term involving the score of the estimate. Specifically for our case of the best estimate of its mean is,
Using this to rewrite in and then mirroring the functional form for as earlier, we get a new score matching objective:
The score-matching objective and the noise-prediction objective differ only by a constant factor that varies over time.
The forward process is describe by an SDE as:
The reverse process is:
The denoising objective is:
Talks
Denoising as a Building Block for Imaging, Inverse Problems, and Machine Learning by Peyman Milanfar
TO-READ
Stable Diffusion.7
Denoising Diffusion Models by Gabriel Peyré (2023)
Footnotes
-
Benton, Joe, Yuyang Shi, Valentin De Bortoli, George Deligiannidis and A. Doucet. “From Denoising Diffusions to Denoising Markov Models.” ArXiv abs/2211.03595 (2022) https://arxiv.org/abs/2211.03595 ↩
-
Friedman, Nir, Ori Mosenzon, Noam Slonim and Naftali Tishby. “Multivariate Information Bottleneck.” Neural Computation 18 (2001): 1739-1789. https://arxiv.org/abs/1301.2270 ↩
-
Luo, Calvin. “Understanding Diffusion Models: A Unified Perspective.” ArXiv abs/2208.11970 (2022) https://arxiv.org/abs/2208.11970 ↩ ↩2 ↩3
-
Ho, Jonathan, Ajay Jain and P. Abbeel. “Denoising Diffusion Probabilistic Models.” _ArXiv_abs/2006.11239 (2020) https://arxiv.org/abs/2006.11239 ↩
-
Kingma, Diederik P., Tim Salimans, Ben Poole and Jonathan Ho. “Variational Diffusion Models.” ArXiv abs/2107.00630 (2021) https://arxiv.org/abs/2107.00630 ↩ ↩2
-
Efron, Bradley. “Tweedie’s Formula and Selection Bias.” Journal of the American Statistical Association 106 (2011): 1602 - 1614. https://www.tandfonline.com/doi/abs/10.1198/jasa.2011.tm11181 ↩
-
Rombach, Robin et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 10674-10685. https://arxiv.org/abs/2112.10752 ↩