The Gaussian Cheatsheet

🧮 math
Table of Contents

This is a collection of key derivations involving Gaussian distributions which commonly arise almost everywhere in Machine Learning.

Normalizing Constant

A Gaussian distribution with mean μ\mu and variance σ2\sigma^2 is given by p(x)∝exp⁡{−(x−μ)22σ2}p(x) \propto \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\}

To derive the normalizing constant for this density, consider the following integral

I=∫−∞∞exp⁡{−(x−μ)22σ2}dx\text{I} = \int_{-\infty}^{\infty} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} dx I2=∫−∞∞exp⁡{−(x−μ)22σ2}exp⁡{−(y−μ)22σ2}dxdy\text{I}^2 = \int_{-\infty}^{\infty} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} \exp\left\{ - \frac{(y - \mu)^2}{2\sigma^2} \right\} dx dy

First using change of variables u=x−μu = x - \mu and v=y−μv = y - \mu, we have

I2=∫−∞∞∫−∞∞exp⁡{−u2+v22σ2}dudv\text{I}^2 = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \exp\left\{ - \frac{u^2 + v^2}{2\sigma^2} \right\} du dv

Transforming to polar coordinates u=rcos⁡θu = r \cos{\theta} and v=rsin⁡θv = r\sin{\theta} using the standard change of variables, we require the Jacobian determinant ∣∂(u,v)∂(r,θ)∣\Big| \frac{\partial (u,v)}{\partial (r, \theta)} \Big|

∣[cos⁡θ−rsin⁡θsin⁡θrcos⁡θ]∣=r\Big| \begin{bmatrix} \cos{\theta} & -r\sin{\theta} \\ \sin{\theta} & r\cos{\theta} \end{bmatrix} \Big| = r I2=∫02π∫0∞exp⁡{−r22σ2}rdrdθ=2π∫0∞exp⁡{−r22σ2}rdr\text{I}^2 = \int_{0}^{2\pi} \int_{0}^{\infty} \exp\left\{ - \frac{r^2}{2\sigma^2} \right\} r dr d\theta = 2\pi \int_{0}^{\infty} \exp\left\{ - \frac{r^2}{2\sigma^2} \right\} r dr

Solving this final integral requires another change of variable. Let z=r22σ2z = \frac{r^2}{2\sigma^2}, hence σ2dz=rdr\sigma^2 dz = r dr.

I2=2πσ2∫0∞e−zdz=2πσ2[−e−z]0∞=2πσ2\text{I}^2 = 2 \pi \sigma^2 \int_{0}^{\infty} e^{-z} dz = 2 \pi \sigma^2 \left[ -e^{-z} \right]_{0}^{\infty} = 2\pi\sigma^2

This implies I=(2πσ2)1/2\text{I} = (2\pi\sigma^2)^{1/2} and hence the complete distribution is now written as

p(x)=1(2πσ2)1/2exp⁡{−(x−μ)22σ2}p(x) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\}

The Gaussian Integral by Keith Conrad shows at least ten other ways to prove this integral. See also Why π\pi is in the normal distribution (beyond integral tricks) .

As a Maximum Entropy Distribution

Interestingly, the Gaussian distribution also turns out to be the maximum entropy distribution on the infinite support for a finite second moment 1. The differential entropy is defined as the expected information H[x]=−∫−∞∞p(x)log⁡p(x)dx\mathbb{H}[x] = -\int_{-\infty}^{\infty} p(x) \log{p(x)} dx of a random variable x∼p(x)x \sim p(x).

To find the maximum entropy distribution, we formally write the constrained optimization problem stated before

max−∫−∞∞p(x)log⁡p(x)dxs.t.∫−∞∞p(x)dx=1∫−∞∞xp(x)dx=μ∫−∞∞(x−μ)2p(x)dx=σ2\begin{aligned} \text{max}& -\int_{-\infty}^{\infty} p(x) \log{p(x)} dx \\ \text{s.t.}& \int_{-\infty}^{\infty} p(x) dx = 1 \\ &\int_{-\infty}^{\infty} x p(x) dx = \mu \\ &\int_{-\infty}^{\infty} (x - \mu)^2 p(x) dx = \sigma^2 \end{aligned}

The constraints correspond to the normalization of probability distributions, finite mean (first moment) and finite variance (finite second moment). This can be converted into an unconstrained optimization problem using Lagrange multipliers 2. The complete objective becomes

−∫−∞∞p(x)log⁡p(x)dx+-\int_{-\infty}^{\infty} p(x) \log{p(x)} dx + λ1(∫−∞∞p(x)dx−1)+λ2(∫−∞∞xp(x)dx−μ)+λ3(∫−∞∞(x−μ)2p(x)dx−σ2)\lambda_1 \left( \int_{-\infty}^{\infty} p(x) dx - 1 \right) + \lambda_2 \left( \int_{-\infty}^{\infty} x p(x) dx - \mu \right) + \lambda_3 \left( \int_{-\infty}^{\infty} (x - \mu)^2 p(x) dx - \sigma^2 \right)

Setting the functional derivative (see Appendix D in Bishop 1) d[f(p(x))]dp(x)=0\frac{d [f(p(x))] }{d p(x)} = 0, we get

−log⁡p(x)+1+λ1+λ2x+λ3(x−μ)2=0-\log{p(x)} + 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 = 0 p(x)=exp⁡{1+λ1+λ2x+λ3(x−μ)2}p(x) = \exp{ \left\{ 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 \right\} }

To recover the precise values of the Lagrange multipliers, we substitute them back into the constraints. The derivation is involved but straightforward. We first manipulate the exponent by completing the squares by noting that for any general quadratic αx2−βx+γ=α(x−β2α)2−12β2−4αγ2α\alpha x^2 - \beta x + \gamma = \alpha \left(x - \frac{\beta}{2\alpha} \right)^2 - \frac{1}{2} \frac{\beta^2 - 4\alpha \gamma}{2\alpha}, which will allow us to re-use results from the previous section. We also further always make use of the subsitution x′=x−β2αx^\prime = x - \frac{\beta}{2\alpha}.

1+λ1+λ2x+λ3(x−μ)2=λ3⏟=αx2−(2μλ3−λ2)⏟=βx+(1+λ1+λ3μ2)⏟=γ1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 = \underbrace{\lambda_3}_{=\alpha} x^2 - \underbrace{(2\mu\lambda_3 - \lambda_2)}_{=\beta} x + \underbrace{(1 + \lambda_1 + \lambda_3 \mu^2)}_{=\gamma} p(x)=exp⁡{α(x−β2α)2}exp⁡{−12β2−4αγ2α⏟=δ}p(x) = \exp{ \left\{ \alpha\left(x - \frac{\beta}{2\alpha} \right)^2 \right\} }\exp{ \left\{ -\underbrace{\frac{1}{2}\frac{\beta^2 - 4\alpha \gamma}{2\alpha}}_{= \delta} \right\} }

Putting back these new definitions the normalization constraint is,

∫−∞∞exp⁡{α(x′)2}dx′=exp⁡{δ}\int_{-\infty}^{\infty} \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime = \exp{ \left\{ \delta \right\} }

Using another change of variable y=−αx′y = \sqrt{-\alpha}x^\prime, we reduce this integral to a familiar form and can evaluate using the polar coordinate transformation trick as earlier.

1−α∫−∞∞exp⁡{−y2}dy=exp⁡{δ}  ⟹  exp⁡{δ}=π−α(1)\begin{aligned} \frac{1}{\sqrt{-\alpha}} \int_{-\infty}^{\infty} \exp{ \left\{ -y^2 \right\} }dy &= \exp{\left\{ \delta \right\}} \\ \implies\exp{ \left\{ \delta \right\} } &= \sqrt{\frac{\pi}{-\alpha}} \tag{1} \end{aligned}

Similarly, we put this into the finite first moment constraint and re-apply the substitution y=−αx′y = \sqrt{-\alpha}x^\prime. The first term is an integral of an odd function over the full domain and is nullified.

∫−∞∞(x′+β2α)exp⁡{α(x′)2}dx′=∫−∞∞x′exp⁡{α(x′)2}dx′+β2α∫−∞∞exp⁡{α(x′)2}=μexp⁡{δ}  ⟹  exp⁡{δ}=β2μαπ−α(2)\begin{aligned} \int_{-\infty}^{\infty} (x^\prime + \frac{\beta}{2\alpha}) \exp{\left\{ \alpha (x^\prime)^2 \right\}} dx^\prime &= \cancel{\int_{-\infty}^{\infty} x^\prime \exp{\left\{ \alpha (x^\prime)^2 \right\}} dx^\prime} + \frac{\beta}{2\alpha} \int_{-\infty}^{\infty} \exp{\left\{ \alpha (x^\prime)^2 \right\}} \\ &= \mu\exp{\left\{\delta\right\}} \\ \implies\exp{\left\{ \delta \right\}} &= \frac{\beta}{2\mu\alpha} \sqrt{\frac{\pi}{-\alpha}} \tag{2} \end{aligned}

Combining (1) with (2), we get β=2μα\beta = 2\mu\alpha. Substituting α\alpha and β\beta as defined earlier, we get

λ2=0\lambda_2 = 0

With similar approaches and substitutions, we substitute values in the integral for finite second moment constraint.

∫−∞∞(x′+β2α−μ)2exp⁡{α(x′)2}dx′=σ2exp⁡{δ}\begin{aligned} \int_{-\infty}^{\infty} \left(x^\prime + \cancel{\frac{\beta}{2\alpha}} - \cancel{\mu} \right)^2 \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime &= \sigma^2 \exp{ \left\{ \delta \right\} } \end{aligned}

Focusing on the remaining term, we first apply the change of variable y=−αx′y = \sqrt{-\alpha}x^\prime and note that this is an even function. This allows us to use the next change of variables.

∫−∞∞(x′)2exp⁡{α(x′)2}dx′=1(−α)3/2∫−∞∞y2exp⁡{−y2}dy\int_{-\infty}^{\infty} (x^\prime)^2 \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime = \frac{1}{(-\alpha)^{3/2}} \int_{-\infty}^{\infty} y^2 \exp{ \left\{ -y^2 \right\} } dy

To allow a change of variable further, we first note that this is an even function and symmetric around 00. Using this knowledge, we can change the limits of integration to positive values and use y2=zy^2 = z

2(−α)3/2∫0∞y2exp⁡{−y2}dy=1(−α)3/2∫0∞z1/2exp⁡{−z}dz=Γ(3/2)(−α)3/2=12π−α3\begin{aligned} \frac{2}{(-\alpha)^{3/2}} \int_{0}^{\infty} y^2 \exp{ \left\{ -y^2 \right\} } dy &= \frac{1}{(-\alpha)^{3/2}} \int_{0}^{\infty} z^{1/2} \exp{ \left\{ -z \right\} } dz \\ &= \frac{\Gamma(3/2)}{(-\alpha)^{3/2}} = \frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} \end{aligned}

where we utilize the fact that Γ(x+1)=xΓ(x)\Gamma(x + 1) = x\Gamma(x), where Γ(x)=∫0∞ux−1e−udu\Gamma(x) = \int_{0}^{\infty} u^{x-1} e^{-u}du is the Gamma function, and Γ(1/2)=π\Gamma(1/2) = \sqrt{\pi}. Plugging everything back and using β=2μα\beta = 2\mu\alpha, we get

12π−α3=σ2exp⁡δ12π−α3=σ2π−α−12α=σ2\begin{aligned} \frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} &= \sigma^2 \exp{\delta} \\ \frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} &= \sigma^2 \sqrt{\frac{\pi}{-\alpha}} \\ -\frac{1}{2\alpha} &= \sigma^2 \end{aligned}

Using this, we get

λ3=−12σ2\lambda_3 = - \frac{1}{2\sigma^2}

Substituting back in (1), we have

exp⁡{4μ2α2−4αγ4α}=π−αexp⁡{μ2λ3−1−λ1−λ3μ2}=2πσ2λ1=−1−12log⁡2πσ2\begin{aligned} \exp{ \left\{ \frac{4\mu^2\alpha^2 - 4\alpha\gamma}{4\alpha} \right\} } &= \sqrt{\frac{\pi}{-\alpha}} \\ \exp{ \left\{ \mu^2\lambda_3 - 1 - \lambda_1 - \lambda_3\mu^2 \right\} } &= \sqrt{2\pi\sigma^2} \\ \lambda_1 = -1 - \frac{1}{2}\log{2\pi\sigma^2} \end{aligned}

Substituting λ1,λ2,λ3\lambda_1,\lambda_2,\lambda_3 back into p(x)p(x) gives us the form for p(x)=N(μ,σ2)p(x) = \mathcal{N}(\mu, \sigma^2). For maximum entropy distributions under other constraints, see examples on this Wikipedia entry

Gaussian Conditionals

Conditional Gaussian distributions are common in machine learning - common examples being Gaussian Processes and Gaussian Linear dynamical systems. Here we discuss the algebraic manipulations on a joint Gaussian distribution. Consider a joint distribution on two random variables p(x, y) where both x,y may be vectors.

p(y,x)=N([μyμx],[ΣyyΣyxΣxyΣxx])\begin{aligned} p(y, x) = \mathcal{N} \left( \begin{bmatrix} \mu_y \\ \mu_x \end{bmatrix} , \begin{bmatrix} \mathbf{\Sigma}_{yy} & \mathbf{\Sigma}_{yx} \\ \mathbf{\Sigma}_{xy} & \mathbf{\Sigma}_{xx} \end{bmatrix} \right) \end{aligned}

We are interested in p(y∣x)p(y | x) (xx and yy are exchangeable throughout). Noting by the chain rule of probabilities that p(y,x)=p(x)p(y∣x)p(y,x) = p(x)p(y|x) we focus on quadratic part of the exponent.

[y−μyx−μx]T[ΣyyΣyxΣxyΣxx]−1[y−μyx−μx]\begin{aligned} \begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix}^{T} \begin{bmatrix} \mathbf{\Sigma}_{yy} & \mathbf{\Sigma}_{yx} \\ \mathbf{\Sigma}_{xy} & \mathbf{\Sigma}_{xx} \end{bmatrix}^{-1}\begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix} \end{aligned}

The Schur complement

We take a digression and discuss the Schur complement 3. Consider a matrix decomposed into blocks

M=[ABCD]\begin{aligned} \mathbf{M} = \begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix} \end{aligned}

We want to (block) diagonalize this so that inversion is easy.

[I−BD−10I][ABCD][I0−D−1CI]=[A−BD−1C00D]\begin{aligned} \begin{bmatrix} \mathbf{I} & -\mathbf{BD}^{-1} \\ \mathbf{0} & \mathbf{I} \end{bmatrix} \begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix} \begin{bmatrix} \mathbf{I} & \mathbf{0} \\ -\mathbf{D}^{-1}\mathbf{C} & \mathbf{I} \end{bmatrix} = \begin{bmatrix} \mathbf{A} - \mathbf{B}\mathbf{D}^{-1}\mathbf{C} & \mathbf{0} \\ \mathbf{0} & \mathbf{D} \end{bmatrix} \end{aligned}

The inverse is now easy to compute by using the property that the inverse of a block diagonal matrix is the inverse of each of the diagonal blocks. We also define M/D=A−BD−1C\mathbf{M} / \mathbf{D} = \mathbf{A} - \mathbf{B}\mathbf{D}^{-1}\mathbf{C}.

[ABCD]−1=[I0−D−1CI][(M/D)−100D−1][I−BD−10I]\begin{aligned} \begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}^{-1} = \begin{bmatrix} \mathbf{I} & \mathbf{0} \\ -\mathbf{D}^{-1}\mathbf{C} & \mathbf{I} \end{bmatrix} \begin{bmatrix} (\mathbf{M} / \mathbf{D})^{-1} & \mathbf{0} \\ \mathbf{0} & \mathbf{D}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{I} & -\mathbf{BD}^{-1} \\ \mathbf{0} & \mathbf{I} \end{bmatrix} \end{aligned}

M/D\mathbf{M}/\mathbf{D} is called the Schur complement of M\mathbf{M} with respect to D\mathbf{D}.

Getting back to our original equation, we can now utilize this result.

[y−μyx−μx]T[I0−Σxx−1ΣxyI][(Σ/Σxx)−100Σyy−1][I−ΣyxΣxx−10I][y−μyx−μx]\begin{aligned} \begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix}^{T} \begin{bmatrix} \mathbf{I} & \mathbf{0} \\ -\mathbf{\Sigma}_{xx}^{-1}\mathbf{\Sigma}_{xy} & \mathbf{I} \end{bmatrix} \begin{bmatrix} (\mathbf{\Sigma} / \mathbf{\Sigma}_{xx})^{-1} & \mathbf{0} \\ \mathbf{0} & \mathbf{\Sigma}_{yy}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{I} & -\mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} \\ \mathbf{0} & \mathbf{I} \end{bmatrix} \begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix} \end{aligned}

Let’s start by resolving the first and last two terms. Remember that we are working with block vectors and matrices. Transpose operators must be preserved and block matrix algebra works like the usual matrix multiplications.

[y−μy−ΣxyΣxx−1(x−μx)x−μx]T[(Σ/Σxx)−100Σxx−1][y−μy−ΣyxΣxx−1(x−μx)x−μx]\begin{aligned} \begin{bmatrix} y - \mu_y - \mathbf{\Sigma}_{xy}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \\ x - \mu_x \end{bmatrix}^{T} \begin{bmatrix} (\mathbf{\Sigma} / \mathbf{\Sigma}_{xx})^{-1} & \mathbf{0} \\ \mathbf{0} & \mathbf{\Sigma}_{xx}^{-1} \end{bmatrix} \begin{bmatrix} y - \mu_y - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \\ x - \mu_x \end{bmatrix} \end{aligned}

We are already seeing a pattern. This can be finalize into two separate terms because the central matrix is already block diagonalized.

(y−μy−ΣyxΣxx−1(x−μx))T(Σ/Σxx)−1(y−μy−ΣyxΣxx−1(x−μx))+(x−μx)TΣxx−1(x−μx)\begin{aligned} \left( y - \mu_y - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \right)^{T}(\mathbf{\Sigma} / \mathbf{\Sigma}_{xx})^{-1}\left( y - \mu_y - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \right) \\ + \left(x - \mu_x \right)^T\mathbf{\Sigma}_{xx}^{-1}\left(x - \mu_x \right) \end{aligned}

We recover the familiar pattern of quadratic terms in the Gaussian exponent. The exponent can now be written as the product of two exponents such that

p(y,x)=N(μy+ΣyxΣxx−1(x−μx),Σ/Σxx)N(x,Σxx)\begin{aligned} p(y,x) = \mathcal{N}\left( \mu_y + \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x), \mathbf{\Sigma} / \mathbf{\Sigma}_{xx} \right) \mathcal{N}\left(x, \mathbf{\Sigma}_{xx}\right) \end{aligned}

where Σ/Σxx=Σyy−ΣyxΣxx−1Σxy\mathbf{\Sigma} / \mathbf{\Sigma}_{xx} = \mathbf{\Sigma}_{yy} - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1}\mathbf{\Sigma}_{xy}

Linear Dynamical Systems

The classic results on a linear dynamical system defined by

p(x)=N(x;μx,Σx)p(y∣x)=N(y;Ax+b,Σy)\begin{aligned} p(x) &= \mathcal{N}(x; \mu_x, \mathbf{\Sigma}_x) \\ p(y|x) &= \mathcal{N}(y; \mathbf{A}x + \mathbf{b}, \mathbf{\Sigma}_y) \end{aligned}

can be derived pretty easily now by reverse engineering the above equations. For instance, the full joint of this linear dynamical system p(x,y)p(x,y) is given by

p(x,y)=N([μxAμx+b],[ΣxΣxATAΣxΣy+AΣxAT])p(x,y) = \mathcal{N}\left(\begin{bmatrix} \mu_x \\ \mathbf{A}\mu_x + \mathbf{b} \end{bmatrix}, \begin{bmatrix} \mathbf{\Sigma}_x & \mathbf{\Sigma}_x\mathbf{A}^T \\ \mathbf{A}\mathbf{\Sigma}_x & \mathbf{\Sigma}_y + \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T \end{bmatrix} \right)

Computing the joint, marginals and other conditionals now should be a matter of plugging in the right values.


  1. Bishop, C.M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). ↩ ↩2

  2. Boyd, S.P., & Vandenberghe, L. (2006). Convex Optimization. IEEE Transactions on Automatic Control, 51, 1859-1859. ↩

  3. Murphy, K. (2012). Machine learning - a probabilistic perspective. Adaptive computation and machine learning series. ↩