This is a collection of key derivations involving Gaussian distributions
which commonly arise almost everywhere in Machine Learning.
Normalizing Constant🔗
A Gaussian distribution with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 is given by p ( x ) ∝ exp { − ( x − μ ) 2 2 σ 2 } p(x) \propto \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} p ( x ) ∝ exp { − 2 σ 2 ( x − μ ) 2 }
To derive the normalizing constant for this density, consider the following integral
I = ∫ − ∞ ∞ exp { − ( x − μ ) 2 2 σ 2 } d x \text{I} = \int_{-\infty}^{\infty} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} dx I = ∫ − ∞ ∞ exp { − 2 σ 2 ( x − μ ) 2 } d x
I 2 = ∫ − ∞ ∞ exp { − ( x − μ ) 2 2 σ 2 } exp { − ( y − μ ) 2 2 σ 2 } d x d y \text{I}^2 = \int_{-\infty}^{\infty} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} \exp\left\{ - \frac{(y - \mu)^2}{2\sigma^2} \right\} dx dy I 2 = ∫ − ∞ ∞ exp { − 2 σ 2 ( x − μ ) 2 } exp { − 2 σ 2 ( y − μ ) 2 } d x d y
First using change of variables u = x − μ u = x - \mu u = x − μ and v = y − μ v = y - \mu v = y − μ , we have
I 2 = ∫ − ∞ ∞ ∫ − ∞ ∞ exp { − u 2 + v 2 2 σ 2 } d u d v \text{I}^2 = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \exp\left\{ - \frac{u^2 + v^2}{2\sigma^2} \right\} du dv I 2 = ∫ − ∞ ∞ ∫ − ∞ ∞ exp { − 2 σ 2 u 2 + v 2 } d u d v
Transforming to polar coordinates u = r cos θ u = r \cos{\theta} u = r cos θ and v = r sin θ v = r\sin{\theta} v = r sin θ
using the standard change of variables, we require the Jacobian determinant ∣ ∂ ( u , v ) ∂ ( r , θ ) ∣ \Big| \frac{\partial (u,v)}{\partial (r, \theta)} \Big| ∂ ( r , θ ) ∂ ( u , v )
∣ [ cos θ − r sin θ sin θ r cos θ ] ∣ = r \Big| \begin{bmatrix} \cos{\theta} & -r\sin{\theta} \\ \sin{\theta} & r\cos{\theta} \end{bmatrix} \Big| = r [ cos θ sin θ − r sin θ r cos θ ] = r
I 2 = ∫ 0 2 π ∫ 0 ∞ exp { − r 2 2 σ 2 } r d r d θ = 2 π ∫ 0 ∞ exp { − r 2 2 σ 2 } r d r \text{I}^2 = \int_{0}^{2\pi} \int_{0}^{\infty} \exp\left\{ - \frac{r^2}{2\sigma^2} \right\} r dr d\theta = 2\pi \int_{0}^{\infty} \exp\left\{ - \frac{r^2}{2\sigma^2} \right\} r dr I 2 = ∫ 0 2 π ∫ 0 ∞ exp { − 2 σ 2 r 2 } r d r d θ = 2 π ∫ 0 ∞ exp { − 2 σ 2 r 2 } r d r
Solving this final integral requires another change of variable. Let z = r 2 2 σ 2 z = \frac{r^2}{2\sigma^2} z = 2 σ 2 r 2 ,
hence σ 2 d z = r d r \sigma^2 dz = r dr σ 2 d z = r d r .
I 2 = 2 π σ 2 ∫ 0 ∞ e − z d z = 2 π σ 2 [ − e − z ] 0 ∞ = 2 π σ 2 \text{I}^2 = 2 \pi \sigma^2 \int_{0}^{\infty} e^{-z} dz = 2 \pi \sigma^2 \left[ -e^{-z} \right]_{0}^{\infty} = 2\pi\sigma^2 I 2 = 2 π σ 2 ∫ 0 ∞ e − z d z = 2 π σ 2 [ − e − z ] 0 ∞ = 2 π σ 2
This implies I = ( 2 π σ 2 ) 1 / 2 \text{I} = (2\pi\sigma^2)^{1/2} I = ( 2 π σ 2 ) 1/2 and hence the complete distribution is now written as
p ( x ) = 1 ( 2 π σ 2 ) 1 / 2 exp { − ( x − μ ) 2 2 σ 2 } p(x) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} p ( x ) = ( 2 π σ 2 ) 1/2 1 exp { − 2 σ 2 ( x − μ ) 2 }
The Gaussian Integral by Keith Conrad shows at least ten other ways to prove this integral. See also Why π \pi π is in the normal distribution (beyond integral tricks) .
As a Maximum Entropy Distribution🔗
Interestingly, the Gaussian distribution also turns out to be the maximum entropy
distribution on the infinite support for a finite second moment 1 .
The differential entropy is defined as the expected information
H [ x ] = − ∫ − ∞ ∞ p ( x ) log p ( x ) d x \mathbb{H}[x] = -\int_{-\infty}^{\infty} p(x) \log{p(x)} dx H [ x ] = − ∫ − ∞ ∞ p ( x ) log p ( x ) d x of a random variable x ∼ p ( x ) x \sim p(x) x ∼ p ( x ) .
To find the maximum entropy distribution, we formally write the constrained optimization
problem stated before
max − ∫ − ∞ ∞ p ( x ) log p ( x ) d x s.t. ∫ − ∞ ∞ p ( x ) d x = 1 ∫ − ∞ ∞ x p ( x ) d x = μ ∫ − ∞ ∞ ( x − μ ) 2 p ( x ) d x = σ 2 \begin{aligned}
\text{max}& -\int_{-\infty}^{\infty} p(x) \log{p(x)} dx \\
\text{s.t.}& \int_{-\infty}^{\infty} p(x) dx = 1 \\
&\int_{-\infty}^{\infty} x p(x) dx = \mu \\
&\int_{-\infty}^{\infty} (x - \mu)^2 p(x) dx = \sigma^2
\end{aligned} max s.t. − ∫ − ∞ ∞ p ( x ) log p ( x ) d x ∫ − ∞ ∞ p ( x ) d x = 1 ∫ − ∞ ∞ x p ( x ) d x = μ ∫ − ∞ ∞ ( x − μ ) 2 p ( x ) d x = σ 2
The constraints correspond to the normalization of probability distributions,
finite mean (first moment) and finite variance (finite second moment). This
can be converted into an unconstrained optimization problem using Lagrange multipliers 2 .
The complete objective becomes
− ∫ − ∞ ∞ p ( x ) log p ( x ) d x + -\int_{-\infty}^{\infty} p(x) \log{p(x)} dx + − ∫ − ∞ ∞ p ( x ) log p ( x ) d x +
λ 1 ( ∫ − ∞ ∞ p ( x ) d x − 1 ) + λ 2 ( ∫ − ∞ ∞ x p ( x ) d x − μ ) + λ 3 ( ∫ − ∞ ∞ ( x − μ ) 2 p ( x ) d x − σ 2 ) \lambda_1 \left( \int_{-\infty}^{\infty} p(x) dx - 1 \right) + \lambda_2 \left( \int_{-\infty}^{\infty} x p(x) dx - \mu \right) + \lambda_3 \left( \int_{-\infty}^{\infty} (x - \mu)^2 p(x) dx - \sigma^2 \right) λ 1 ( ∫ − ∞ ∞ p ( x ) d x − 1 ) + λ 2 ( ∫ − ∞ ∞ x p ( x ) d x − μ ) + λ 3 ( ∫ − ∞ ∞ ( x − μ ) 2 p ( x ) d x − σ 2 )
Setting the functional derivative (see Appendix D in Bishop 1 ) d [ f ( p ( x ) ) ] d p ( x ) = 0 \frac{d [f(p(x))] }{d p(x)} = 0 d p ( x ) d [ f ( p ( x ))] = 0 , we get
− log p ( x ) + 1 + λ 1 + λ 2 x + λ 3 ( x − μ ) 2 = 0 -\log{p(x)} + 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 = 0 − log p ( x ) + 1 + λ 1 + λ 2 x + λ 3 ( x − μ ) 2 = 0
p ( x ) = exp { 1 + λ 1 + λ 2 x + λ 3 ( x − μ ) 2 } p(x) = \exp{ \left\{ 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 \right\} } p ( x ) = exp { 1 + λ 1 + λ 2 x + λ 3 ( x − μ ) 2 }
To recover the precise values of the Lagrange multipliers, we substitute them back
into the constraints. The derivation is involved but straightforward. We first manipulate the exponent by
completing the squares by noting that for any general quadratic α x 2 − β x + γ = α ( x − β 2 α ) 2 − 1 2 β 2 − 4 α γ 2 α \alpha x^2 - \beta x + \gamma = \alpha \left(x - \frac{\beta}{2\alpha} \right)^2 - \frac{1}{2} \frac{\beta^2 - 4\alpha \gamma}{2\alpha} α x 2 − β x + γ = α ( x − 2 α β ) 2 − 2 1 2 α β 2 − 4 α γ , which will allow us to re-use results from the previous section . We also further always make use of the subsitution x ′ = x − β 2 α x^\prime = x - \frac{\beta}{2\alpha} x ′ = x − 2 α β .
1 + λ 1 + λ 2 x + λ 3 ( x − μ ) 2 = λ 3 ⏟ = α x 2 − ( 2 μ λ 3 − λ 2 ) ⏟ = β x + ( 1 + λ 1 + λ 3 μ 2 ) ⏟ = γ 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 = \underbrace{\lambda_3}_{=\alpha} x^2 - \underbrace{(2\mu\lambda_3 - \lambda_2)}_{=\beta} x + \underbrace{(1 + \lambda_1 + \lambda_3 \mu^2)}_{=\gamma} 1 + λ 1 + λ 2 x + λ 3 ( x − μ ) 2 = = α λ 3 x 2 − = β ( 2 μ λ 3 − λ 2 ) x + = γ ( 1 + λ 1 + λ 3 μ 2 )
p ( x ) = exp { α ( x − β 2 α ) 2 } exp { − 1 2 β 2 − 4 α γ 2 α ⏟ = δ } p(x) = \exp{ \left\{ \alpha\left(x - \frac{\beta}{2\alpha} \right)^2 \right\} }\exp{ \left\{ -\underbrace{\frac{1}{2}\frac{\beta^2 - 4\alpha \gamma}{2\alpha}}_{= \delta} \right\} } p ( x ) = exp { α ( x − 2 α β ) 2 } exp ⎩ ⎨ ⎧ − = δ 2 1 2 α β 2 − 4 α γ ⎭ ⎬ ⎫
Putting back these new definitions the normalization constraint is,
∫ − ∞ ∞ exp { α ( x ′ ) 2 } d x ′ = exp { δ } \int_{-\infty}^{\infty} \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime = \exp{ \left\{ \delta \right\} } ∫ − ∞ ∞ exp { α ( x ′ ) 2 } d x ′ = exp { δ }
Using another change of variable y = − α x ′ y = \sqrt{-\alpha}x^\prime y = − α x ′ , we reduce this integral to a familiar form and can evaluate using the polar coordinate transformation trick as earlier.
1 − α ∫ − ∞ ∞ exp { − y 2 } d y = exp { δ } ⟹ exp { δ } = π − α (1) \begin{aligned}
\frac{1}{\sqrt{-\alpha}} \int_{-\infty}^{\infty} \exp{ \left\{ -y^2 \right\} }dy &= \exp{\left\{ \delta \right\}} \\
\implies\exp{ \left\{ \delta \right\} } &= \sqrt{\frac{\pi}{-\alpha}} \tag{1}
\end{aligned} − α 1 ∫ − ∞ ∞ exp { − y 2 } d y ⟹ exp { δ } = exp { δ } = − α π ( 1 )
Similarly, we put this into the finite first moment constraint and re-apply
the substitution y = − α x ′ y = \sqrt{-\alpha}x^\prime y = − α x ′ . The first term is an integral
of an odd function over the full domain and is nullified.
∫ − ∞ ∞ ( x ′ + β 2 α ) exp { α ( x ′ ) 2 } d x ′ = ∫ − ∞ ∞ x ′ exp { α ( x ′ ) 2 } d x ′ + β 2 α ∫ − ∞ ∞ exp { α ( x ′ ) 2 } = μ exp { δ } ⟹ exp { δ } = β 2 μ α π − α (2) \begin{aligned}
\int_{-\infty}^{\infty} (x^\prime + \frac{\beta}{2\alpha}) \exp{\left\{ \alpha (x^\prime)^2 \right\}} dx^\prime &= \cancel{\int_{-\infty}^{\infty} x^\prime \exp{\left\{ \alpha (x^\prime)^2 \right\}} dx^\prime} + \frac{\beta}{2\alpha} \int_{-\infty}^{\infty} \exp{\left\{ \alpha (x^\prime)^2 \right\}} \\
&= \mu\exp{\left\{\delta\right\}} \\
\implies\exp{\left\{ \delta \right\}} &= \frac{\beta}{2\mu\alpha} \sqrt{\frac{\pi}{-\alpha}} \tag{2}
\end{aligned} ∫ − ∞ ∞ ( x ′ + 2 α β ) exp { α ( x ′ ) 2 } d x ′ ⟹ exp { δ } = ∫ − ∞ ∞ x ′ exp { α ( x ′ ) 2 } d x ′ + 2 α β ∫ − ∞ ∞ exp { α ( x ′ ) 2 } = μ exp { δ } = 2 μα β − α π ( 2 )
Combining (1) with (2), we get β = 2 μ α \beta = 2\mu\alpha β = 2 μα . Substituting α \alpha α and β \beta β
as defined earlier, we get
λ 2 = 0 \lambda_2 = 0 λ 2 = 0
With similar approaches and substitutions, we substitute values in the integral for finite
second moment constraint.
∫ − ∞ ∞ ( x ′ + β 2 α − μ ) 2 exp { α ( x ′ ) 2 } d x ′ = σ 2 exp { δ } \begin{aligned}
\int_{-\infty}^{\infty} \left(x^\prime + \cancel{\frac{\beta}{2\alpha}} - \cancel{\mu} \right)^2 \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime &= \sigma^2 \exp{ \left\{ \delta \right\} }
\end{aligned} ∫ − ∞ ∞ ( x ′ + 2 α β − μ ) 2 exp { α ( x ′ ) 2 } d x ′ = σ 2 exp { δ }
Focusing on the remaining term, we first apply the change of variable y = − α x ′ y = \sqrt{-\alpha}x^\prime y = − α x ′ and note
that this is an even function. This allows us to use the next change of variables.
∫ − ∞ ∞ ( x ′ ) 2 exp { α ( x ′ ) 2 } d x ′ = 1 ( − α ) 3 / 2 ∫ − ∞ ∞ y 2 exp { − y 2 } d y \int_{-\infty}^{\infty} (x^\prime)^2 \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime = \frac{1}{(-\alpha)^{3/2}} \int_{-\infty}^{\infty} y^2 \exp{ \left\{ -y^2 \right\} } dy ∫ − ∞ ∞ ( x ′ ) 2 exp { α ( x ′ ) 2 } d x ′ = ( − α ) 3/2 1 ∫ − ∞ ∞ y 2 exp { − y 2 } d y
To allow a change of variable further, we first note that this is an even function and symmetric around 0 0 0 .
Using this knowledge, we can change the limits of integration to positive values and use y 2 = z y^2 = z y 2 = z
2 ( − α ) 3 / 2 ∫ 0 ∞ y 2 exp { − y 2 } d y = 1 ( − α ) 3 / 2 ∫ 0 ∞ z 1 / 2 exp { − z } d z = Γ ( 3 / 2 ) ( − α ) 3 / 2 = 1 2 π − α 3 \begin{aligned}
\frac{2}{(-\alpha)^{3/2}} \int_{0}^{\infty} y^2 \exp{ \left\{ -y^2 \right\} } dy &= \frac{1}{(-\alpha)^{3/2}} \int_{0}^{\infty} z^{1/2} \exp{ \left\{ -z \right\} } dz \\
&= \frac{\Gamma(3/2)}{(-\alpha)^{3/2}} = \frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}}
\end{aligned} ( − α ) 3/2 2 ∫ 0 ∞ y 2 exp { − y 2 } d y = ( − α ) 3/2 1 ∫ 0 ∞ z 1/2 exp { − z } d z = ( − α ) 3/2 Γ ( 3/2 ) = 2 1 − α 3 π
where we utilize the fact that Γ ( x + 1 ) = x Γ ( x ) \Gamma(x + 1) = x\Gamma(x) Γ ( x + 1 ) = x Γ ( x ) , where Γ ( x ) = ∫ 0 ∞ u x − 1 e − u d u \Gamma(x) = \int_{0}^{\infty} u^{x-1} e^{-u}du Γ ( x ) = ∫ 0 ∞ u x − 1 e − u d u is the Gamma function, and Γ ( 1 / 2 ) = π \Gamma(1/2) = \sqrt{\pi} Γ ( 1/2 ) = π . Plugging
everything back and using β = 2 μ α \beta = 2\mu\alpha β = 2 μα , we get
1 2 π − α 3 = σ 2 exp δ 1 2 π − α 3 = σ 2 π − α − 1 2 α = σ 2 \begin{aligned}
\frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} &= \sigma^2 \exp{\delta} \\
\frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} &= \sigma^2 \sqrt{\frac{\pi}{-\alpha}} \\
-\frac{1}{2\alpha} &= \sigma^2
\end{aligned} 2 1 − α 3 π 2 1 − α 3 π − 2 α 1 = σ 2 exp δ = σ 2 − α π = σ 2
Using this, we get
λ 3 = − 1 2 σ 2 \lambda_3 = - \frac{1}{2\sigma^2} λ 3 = − 2 σ 2 1
Substituting back in (1), we have
exp { 4 μ 2 α 2 − 4 α γ 4 α } = π − α exp { μ 2 λ 3 − 1 − λ 1 − λ 3 μ 2 } = 2 π σ 2 λ 1 = − 1 − 1 2 log 2 π σ 2 \begin{aligned}
\exp{ \left\{ \frac{4\mu^2\alpha^2 - 4\alpha\gamma}{4\alpha} \right\} } &= \sqrt{\frac{\pi}{-\alpha}} \\
\exp{ \left\{ \mu^2\lambda_3 - 1 - \lambda_1 - \lambda_3\mu^2 \right\} } &= \sqrt{2\pi\sigma^2} \\
\lambda_1 = -1 - \frac{1}{2}\log{2\pi\sigma^2}
\end{aligned} exp { 4 α 4 μ 2 α 2 − 4 α γ } exp { μ 2 λ 3 − 1 − λ 1 − λ 3 μ 2 } λ 1 = − 1 − 2 1 log 2 π σ 2 = − α π = 2 π σ 2
Substituting λ 1 , λ 2 , λ 3 \lambda_1,\lambda_2,\lambda_3 λ 1 , λ 2 , λ 3 back into p ( x ) p(x) p ( x ) gives us the form for p ( x ) = N ( μ , σ 2 ) p(x) = \mathcal{N}(\mu, \sigma^2) p ( x ) = N ( μ , σ 2 ) . For maximum entropy distributions under other constraints, see examples on this Wikipedia entry
Gaussian Conditionals🔗
Conditional Gaussian distributions are common in machine learning - common
examples being Gaussian Processes and Gaussian Linear dynamical systems. Here
we discuss the algebraic manipulations on a joint Gaussian distribution. Consider
a joint distribution on two random variables p(x, y) where both x,y may be vectors.
p ( y , x ) = N ( [ μ y μ x ] , [ Σ y y Σ y x Σ x y Σ x x ] ) \begin{aligned}
p(y, x) = \mathcal{N} \left(
\begin{bmatrix} \mu_y \\ \mu_x \end{bmatrix}
, \begin{bmatrix} \mathbf{\Sigma}_{yy} & \mathbf{\Sigma}_{yx} \\ \mathbf{\Sigma}_{xy} & \mathbf{\Sigma}_{xx} \end{bmatrix} \right)
\end{aligned} p ( y , x ) = N ( [ μ y μ x ] , [ Σ yy Σ x y Σ y x Σ xx ] )
We are interested in p ( y ∣ x ) p(y | x) p ( y ∣ x ) (x x x and y y y are exchangeable throughout). Noting
by the chain rule of probabilities that p ( y , x ) = p ( x ) p ( y ∣ x ) p(y,x) = p(x)p(y|x) p ( y , x ) = p ( x ) p ( y ∣ x ) we focus on quadratic
part of the exponent.
[ y − μ y x − μ x ] T [ Σ y y Σ y x Σ x y Σ x x ] − 1 [ y − μ y x − μ x ] \begin{aligned}
\begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix}^{T} \begin{bmatrix} \mathbf{\Sigma}_{yy} & \mathbf{\Sigma}_{yx} \\ \mathbf{\Sigma}_{xy} & \mathbf{\Sigma}_{xx} \end{bmatrix}^{-1}\begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix}
\end{aligned} [ y − μ y x − μ x ] T [ Σ yy Σ x y Σ y x Σ xx ] − 1 [ y − μ y x − μ x ]
The Schur complement🔗
We take a digression and discuss the Schur complement 3 .
Consider a matrix decomposed into blocks
M = [ A B C D ] \begin{aligned}
\mathbf{M} = \begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}
\end{aligned} M = [ A C B D ]
We want to (block) diagonalize this so that inversion is easy.
[ I − B D − 1 0 I ] [ A B C D ] [ I 0 − D − 1 C I ] = [ A − B D − 1 C 0 0 D ] \begin{aligned}
\begin{bmatrix}
\mathbf{I} & -\mathbf{BD}^{-1} \\ \mathbf{0} & \mathbf{I}
\end{bmatrix}
\begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}
\begin{bmatrix}
\mathbf{I} & \mathbf{0} \\ -\mathbf{D}^{-1}\mathbf{C} & \mathbf{I}
\end{bmatrix} =
\begin{bmatrix}
\mathbf{A} - \mathbf{B}\mathbf{D}^{-1}\mathbf{C} & \mathbf{0} \\
\mathbf{0} & \mathbf{D}
\end{bmatrix}
\end{aligned} [ I 0 − BD − 1 I ] [ A C B D ] [ I − D − 1 C 0 I ] = [ A − B D − 1 C 0 0 D ]
The inverse is now easy to compute by using the property that the inverse of a
block diagonal matrix is the inverse of each of the diagonal blocks. We also define
M / D = A − B D − 1 C \mathbf{M} / \mathbf{D} = \mathbf{A} - \mathbf{B}\mathbf{D}^{-1}\mathbf{C} M / D = A − B D − 1 C .
[ A B C D ] − 1 = [ I 0 − D − 1 C I ] [ ( M / D ) − 1 0 0 D − 1 ] [ I − B D − 1 0 I ] \begin{aligned}
\begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}^{-1} =
\begin{bmatrix}
\mathbf{I} & \mathbf{0} \\ -\mathbf{D}^{-1}\mathbf{C} & \mathbf{I}
\end{bmatrix}
\begin{bmatrix}
(\mathbf{M} / \mathbf{D})^{-1} & \mathbf{0} \\
\mathbf{0} & \mathbf{D}^{-1}
\end{bmatrix}
\begin{bmatrix}
\mathbf{I} & -\mathbf{BD}^{-1} \\ \mathbf{0} & \mathbf{I}
\end{bmatrix}
\end{aligned} [ A C B D ] − 1 = [ I − D − 1 C 0 I ] [ ( M / D ) − 1 0 0 D − 1 ] [ I 0 − BD − 1 I ]
M / D \mathbf{M}/\mathbf{D} M / D is called the Schur complement of M \mathbf{M} M with respect to D \mathbf{D} D .
Getting back to our original equation, we can now utilize this result.
[ y − μ y x − μ x ] T [ I 0 − Σ x x − 1 Σ x y I ] [ ( Σ / Σ x x ) − 1 0 0 Σ y y − 1 ] [ I − Σ y x Σ x x − 1 0 I ] [ y − μ y x − μ x ] \begin{aligned}
\begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix}^{T}
\begin{bmatrix}
\mathbf{I} & \mathbf{0} \\ -\mathbf{\Sigma}_{xx}^{-1}\mathbf{\Sigma}_{xy} & \mathbf{I}
\end{bmatrix}
\begin{bmatrix}
(\mathbf{\Sigma} / \mathbf{\Sigma}_{xx})^{-1} & \mathbf{0} \\
\mathbf{0} & \mathbf{\Sigma}_{yy}^{-1}
\end{bmatrix}
\begin{bmatrix}
\mathbf{I} & -\mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} \\ \mathbf{0} & \mathbf{I}
\end{bmatrix}
\begin{bmatrix} y - \mu_y \\ x - \mu_x \end{bmatrix}
\end{aligned} [ y − μ y x − μ x ] T [ I − Σ xx − 1 Σ x y 0 I ] [ ( Σ / Σ xx ) − 1 0 0 Σ yy − 1 ] [ I 0 − Σ y x Σ xx − 1 I ] [ y − μ y x − μ x ]
Let’s start by resolving the first and last two terms. Remember
that we are working with block vectors and matrices. Transpose operators must be
preserved and block matrix algebra works like the usual matrix multiplications.
[ y − μ y − Σ x y Σ x x − 1 ( x − μ x ) x − μ x ] T [ ( Σ / Σ x x ) − 1 0 0 Σ x x − 1 ] [ y − μ y − Σ y x Σ x x − 1 ( x − μ x ) x − μ x ] \begin{aligned}
\begin{bmatrix} y - \mu_y - \mathbf{\Sigma}_{xy}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \\ x - \mu_x \end{bmatrix}^{T}
\begin{bmatrix}
(\mathbf{\Sigma} / \mathbf{\Sigma}_{xx})^{-1} & \mathbf{0} \\
\mathbf{0} & \mathbf{\Sigma}_{xx}^{-1}
\end{bmatrix}
\begin{bmatrix} y - \mu_y - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \\ x - \mu_x \end{bmatrix}
\end{aligned} [ y − μ y − Σ x y Σ xx − 1 ( x − μ x ) x − μ x ] T [ ( Σ / Σ xx ) − 1 0 0 Σ xx − 1 ] [ y − μ y − Σ y x Σ xx − 1 ( x − μ x ) x − μ x ]
We are already seeing a pattern. This can be finalize into two separate terms because
the central matrix is already block diagonalized.
( y − μ y − Σ y x Σ x x − 1 ( x − μ x ) ) T ( Σ / Σ x x ) − 1 ( y − μ y − Σ y x Σ x x − 1 ( x − μ x ) ) + ( x − μ x ) T Σ x x − 1 ( x − μ x ) \begin{aligned}
\left( y - \mu_y - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \right)^{T}(\mathbf{\Sigma} / \mathbf{\Sigma}_{xx})^{-1}\left( y - \mu_y - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x) \right) \\
+ \left(x - \mu_x \right)^T\mathbf{\Sigma}_{xx}^{-1}\left(x - \mu_x \right)
\end{aligned} ( y − μ y − Σ y x Σ xx − 1 ( x − μ x ) ) T ( Σ / Σ xx ) − 1 ( y − μ y − Σ y x Σ xx − 1 ( x − μ x ) ) + ( x − μ x ) T Σ xx − 1 ( x − μ x )
We recover the familiar pattern of quadratic terms in the Gaussian exponent. The
exponent can now be written as the product of two exponents such that
p ( y , x ) = N ( μ y + Σ y x Σ x x − 1 ( x − μ x ) , Σ / Σ x x ) N ( x , Σ x x ) \begin{aligned}
p(y,x) = \mathcal{N}\left( \mu_y + \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1} (x - \mu_x), \mathbf{\Sigma} / \mathbf{\Sigma}_{xx} \right) \mathcal{N}\left(x, \mathbf{\Sigma}_{xx}\right)
\end{aligned} p ( y , x ) = N ( μ y + Σ y x Σ xx − 1 ( x − μ x ) , Σ / Σ xx ) N ( x , Σ xx )
where Σ / Σ x x = Σ y y − Σ y x Σ x x − 1 Σ x y \mathbf{\Sigma} / \mathbf{\Sigma}_{xx} = \mathbf{\Sigma}_{yy} - \mathbf{\Sigma}_{yx}\mathbf{\Sigma}_{xx}^{-1}\mathbf{\Sigma}_{xy} Σ / Σ xx = Σ yy − Σ y x Σ xx − 1 Σ x y
Linear Dynamical Systems🔗
The classic results on a linear dynamical system defined by
p ( x ) = N ( x ; μ x , Σ x ) p ( y ∣ x ) = N ( y ; A x + b , Σ y ) \begin{aligned}
p(x) &= \mathcal{N}(x; \mu_x, \mathbf{\Sigma}_x) \\
p(y|x) &= \mathcal{N}(y; \mathbf{A}x + \mathbf{b}, \mathbf{\Sigma}_y)
\end{aligned} p ( x ) p ( y ∣ x ) = N ( x ; μ x , Σ x ) = N ( y ; A x + b , Σ y )
can be derived pretty easily now by reverse engineering the above equations. For
instance, the full joint of this linear dynamical system p ( x , y ) p(x,y) p ( x , y ) is given by
p ( x , y ) = N ( [ μ x A μ x + b ] , [ Σ x Σ x A T A Σ x Σ y + A Σ x A T ] ) p(x,y) = \mathcal{N}\left(\begin{bmatrix}
\mu_x \\ \mathbf{A}\mu_x + \mathbf{b}
\end{bmatrix}, \begin{bmatrix}
\mathbf{\Sigma}_x & \mathbf{\Sigma}_x\mathbf{A}^T \\
\mathbf{A}\mathbf{\Sigma}_x & \mathbf{\Sigma}_y + \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T
\end{bmatrix} \right) p ( x , y ) = N ( [ μ x A μ x + b ] , [ Σ x A Σ x Σ x A T Σ y + A Σ x A T ] )
Computing the joint, marginals and other conditionals now should be a matter of
plugging in the right values.