Diffusion Models
Living document on diffusion models from scratch.
Table of Contents
The basic idea is to pick a forward noise process $(q_t)_{t\geq 0}$ that converges to some easy to sample distribution $q_\mathrm{ref}$, which is taken to be $p_0$. By sampling $p_0$, we now reverse the process to produce a sample from the original data distribution.^{1}
Variational Perspective
We consider the known noising process starting from a data sample $x_0$ as $q(x_{t} \mid x_{t1})$, and learn a reverse process as $p(x_{t1} \mid x_t)$ for $t \in [1,T]$. We can now build a variational lower bound (also motivated from a multivariate information bottleneck^{2} perspective here) to the marginal distribution over data $p(x_0)$ as,^{3}
$\log{p(x_0)} \geq \mathbb{E}_{q(x_{1:T} \mid x_0)}\left[ \frac{\log{p(x_{0:T})}}{\log{q(x_{1:T}\mid x_0)}} \right] = \ell(x_{0:T})$Using an autoregressive decomposition of the $p$ and $q$ distributions, the lower bound can be further decomposed as:
$\ell(x_{0:T}) = \mathbb{E}_{q(x_1 \mid x_0)}\left[ \log{p}(x_0 \mid x_1) \right]  \mathbb{E}_{q(x_{T1}\mid x_0)}[\mathcal{KL}(q(x_T \mid x_{T1}) ~\lvert\rvert~ p(x_T))]  \sum_{t=1}^T \mathbb{E}_{q(x_{t1},x_{t+1}\mid x_0)} \left[ \mathcal{KL}(q(x_t \mid x_{t1}) ~\lvert\rvert~ p(x_t \mid x_{t+1})) \right]$We have a usual reconstruction term (onestep latent as in usual amortized VAE), a prior matching term (independent of anything learnable so can be ignored), and a consistency term (between the forward process $q$ and the backward process $p$).
The consistency term above computes expectation over two variables and can be higher variance in practice. The key insight here is to change the conditioning in the forward process as:
$q(x_t \mid x_{t1}) = q(x_t \mid x_{t1}, x_0) = \frac{q(x_{t1}\mid x_t,x_0)q(x_t\mid x_0)}{q(x_{t1}\mid x_0)}$The equivalent objective now is:
$\ell(x_{0:T}) = \mathbb{E}_{q(x_1 \mid x_0)}\left[ \log{p}(x_0 \mid x_1) \right]  \mathcal{KL}(q(x_T\mid x_0) ~\Vert~ p(x_T))  \sum_{t=2}^T \mathbb{E}_{q(x_t\mid x_0)\left[ \mathcal{KL}(q(x_{t1}\mid x_t, x_0) ~\Vert~ p(x_{t1} \mid x_t)) \right]}$The reconstruction term is the same, and the prior matching term independent of anything trainable (but also zero under our assumptions). The consistency term is now replaced with a denoising matching term, which only depends on expectation over a single variable.
Now for the $\mathcal{KL}$ terms in the denoising matching part of the objective, because we know that the distributions implied by the noising process $q$ are Gaussian, using the Bayes’ rule and reparametrization trick,
$q(x_{t1}\mid x_t, x_0) = \frac{q(x_t \mid x_{t1},x_0)q(x_{t1}\mid x_0)}{q(x_t \mid x_0)}$$q(x_t \mid x_{t1}) = \mathcal{N}(\sqrt{\alpha_t} x_{t1}, (1\alpha_t)\mathbf{I})$ by assumption of noise schedule $\alpha_{t}$, which could either be fixed^{4} or learned,^{5} chosen as a variance preserving scehdule. Under such a noise schedule, we also get $q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1\bar{\alpha}_t) \mathbf{I})$, where $\bar{\alpha}_t = \prod_{i=1}^t \alpha_t$.
Using such a schedule, we can get a closedform for mean $\mu_q(x_t, x_0)$ and variance $\sigma_q^2(t) \mathbf{I}$ of $q(x_{t1} \mid x_t, x_0)$ (see Eq. (84)^{3}). For the denoising model $p(x_{t1} \mid x_t)$, we can immediately construct the variance to be the same but the mean is parametrized as $\mu_p(x_t,t)$. The $\mathcal{KL}$ between two Gaussians is then simply a difference between the means.
By mirroring the specific form of the $\mu_q$ to $\mu_p$, we can simplify to operands in the optimization problem to be simply denoising the input^{5} at different noise levels as:
$\frac{1}{2\sigma_q^2(t)}\frac{\bar{\alpha}_{t1}(1\alpha_t)^2}{(1\bar{\alpha}_t)^2} \left[ \lVert \hat{x}_{\theta}(x_t, t)  x_0 \rVert_2^2 \right]$Using the definition of signaltonoise (SNR) ratio as the ratio of mean squared to variance, we can simplify the above objective to
$\frac{1}{2}(\mathrm{SNR}(t1)  \mathrm{SNR}(t)) \left[ \lVert \hat{x}_{\theta}(x_t, t)  x_0 \rVert_2^2 \right]$In practice, noting that we can rewrite $x_0$ in $\mu_q(x_t,x_0)$ in terms of a noise random variable $\epsilon_0 \sim \mathcal{N}(0, \mathbf{I})$, by the relation $x_0 = \frac{x_t  \sqrt{1\bar{\alpha}_t}\epsilon_0}{\sqrt{\bar{\alpha}_t}}$ and then mirroring the functional form for $\mu_p(x_t, t)$ as earlier, we can instead match the source noise which works better in practice (see Eq. 115^{3}):
$\frac{1}{2\sigma_q^2(t)}\frac{\bar{\alpha}_{t1}(1\alpha_t)^2}{(1\bar{\alpha}_t)^2} \left[ \lVert \hat{\epsilon}_{\theta}(x_t, t)  \epsilon_0 \rVert_2^2 \right]$SDE Perspective
Another alternative objective takes a scorematching form due to Tweedie’s Formula^{6}, which states that the true mean of an exponential family distribution can be estimated by the maximum likelihood estimate plus a correction term involving the score of the estimate. Specifically for our case of $q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1\bar{\alpha}_t) \mathbf{I})$ the best estimate of its mean is,
$\sqrt{\bar{\alpha}_t}x_0 = x_t + (1  \bar{\alpha}_t) \nabla_{x_t}\log{p(x_t)}$Using this to rewrite $x_0$ in $\mu_q(x_t,x_0)$ and then mirroring the functional form for $\mu_p(x_t,t)$ as earlier, we get a new score matching objective:
$\frac{1}{2\sigma_q^2(t)}\frac{(1\alpha_t)^2}{\alpha_t^2} \left[ \lVert \hat{s}_{\theta}(x_t, t)  \nabla_{x_t}\log{p(x_t)} \rVert_2^2 \right]$The scorematching objective and the noiseprediction objective differ only by a constant factor that varies over time.
The forward process is describe by an SDE as:
$dY_t = b(Y_t, t)dt + dB_t$The reverse process is:
$dX_t = (b(X_t, Tt) + \nabla_x \log{q_{Tt}(X_t)})$The denoising objective is:
$I(\theta) = \frac{1}{2} \int_0^T \mathbb{E}_{q_{0,T}(x_0,x_T)}\left[ \lVert \nabla_{x}\log q(x_t\mid x_0)  s_\theta(x_t, t) \rVert^2 \right] dt$TOREAD
Stable Diffusion.^{7}
Denoising Diffusion Models by Gabriel Peyré (2023)
Footnotes

Benton, Joe, Yuyang Shi, Valentin De Bortoli, George Deligiannidis and A. Doucet. “From Denoising Diffusions to Denoising Markov Models.” ArXiv abs/2211.03595 (2022) https://arxiv.org/abs/2211.03595 ↩

Friedman, Nir, Ori Mosenzon, Noam Slonim and Naftali Tishby. “Multivariate Information Bottleneck.” Neural Computation 18 (2001): 17391789. https://arxiv.org/abs/1301.2270 ↩

Luo, Calvin. “Understanding Diffusion Models: A Unified Perspective.” ArXiv abs/2208.11970 (2022) https://arxiv.org/abs/2208.11970 ↩ ↩^{2} ↩^{3}

Ho, Jonathan, Ajay Jain and P. Abbeel. “Denoising Diffusion Probabilistic Models.” _ArXiv_abs/2006.11239 (2020) https://arxiv.org/abs/2006.11239 ↩

Kingma, Diederik P., Tim Salimans, Ben Poole and Jonathan Ho. “Variational Diffusion Models.” ArXiv abs/2107.00630 (2021) https://arxiv.org/abs/2107.00630 ↩ ↩^{2}

Efron, Bradley. “Tweedie’s Formula and Selection Bias.” Journal of the American Statistical Association 106 (2011): 1602  1614. https://www.tandfonline.com/doi/abs/10.1198/jasa.2011.tm11181 ↩

Rombach, Robin et al. “HighResolution Image Synthesis with Latent Diffusion Models.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 1067410685. https://arxiv.org/abs/2112.10752 ↩