Orthogonal projectors and linear regression

A formal geometric summary of what maximum likelihood solution for linear regression signifies.

🧮 math
Table of Contents

Linear regression is one of the most well studied machine learning algorithms. A geometric perspective also makes rounds as to what the maximum likelihood solution of a linear regression problem signifies, although the explanations are often imprecise and hand-wavy. This post will hopefully be a helpful summary to everyone who was searching for a precise argument around the geometric interpretation of a least squares solution.

Setup

As usual, linear regression involves a dataset D\mathcal{D} of NN pairs of inputs, in RD\mathbb{R}^D and outputs in R\mathbb{R}, {xi,yi}i=1N\{ \mathbf{x}_i, y_i \}_{i=1}^N. The objective is to learn the parameters w\mathbf{w} of a linear function of the form [^b]

y=wTϕ(x)+ϵ,y = \mathbf{w}^T\mathbf{\phi}(\mathbf{x}) + \epsilon,

where ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2) is data-independent Gaussian additive noise with variance. ϕ\mathbf{\phi} represents the vector [ϕ1(x)ϕ2(x)ϕM(x)]\begin{bmatrix} \phi_1(\mathbf{x}) & \phi_2(\mathbf{x}) & \cdots & \phi_M(\mathbf{x}) \end{bmatrix} of features extracted from each input variable using a set of MM basis functions ϕi\phi_i. We want to generalize beyond the training data and quantify our performance using the mean squared error as the notion of risk. We will not be Bayesian about this problem for now and instead focus on getting a point value wML\mathbf{w}_{ML} called the maximum likelihood estimate, as is commonly used. The key result typically shown in an introductory machine learning class for the maximum likelihood estimate under mean squared error is reproduced below.

wML=(ΦTΦ)1ΦTy\mathbf{w}_{ML} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{y}

where y\mathbf{y} is the column vector in RN\mathbf{R}^N consisting of all the output values from our training data set and Φ\Phi is a N×MN \times M design matrix consisting of ϕ(xi)  i{1,,N}\phi(\mathbf{x}_i)~\forall~i \in \{1, \dots, N\} in each row as brlow.

Φ=(ϕ1(x1)ϕM(x1)ϕ1(xN)ϕM(xN))\Phi = \begin{pmatrix} \phi_1(\mathbf{x}_1) & \cdots & \phi_M(\mathbf{x}_1) \\ \vdots & \vdots & \vdots \\ \phi_1(\mathbf{x}_N) & \cdots & \phi_M(\mathbf{x}_N) \\ \end{pmatrix}

We are interested in the geometric interpretation of this wML\mathbf{w}_{ML} and we use the formal tool from linear algebra called orthogonal projectors. Familiarity with basic linear algebra is assumed.

Projectors

With the lack of visuals for now, imagine everything in two or three dimensions. Everything can be generalized to finite higher-dimensional spaces.

A projector 1 is a square matrix PP that satisfies P2=PP^2 = P. In plain words, this means that for any vector vv, PvPv is the projection of vv it into a column-space of PP. Further, re-applying the projection to this new vector P(Pv)P(Pv) keeps the new vector intact. This is also known as an idempotent matrix.

Complementary Projectors

If PP is a projector, then IPI - P is also a projector, where II is identity matrix of compatible dimensions. This is known as the complementary projector, for reasons we describe next.

IPI - P projects exactly onto the null space of PP. To see why, we note that for any vector vv in the null space of PP, Pv=0    (IP)v=vPv = 0 \implies (I - P)v = v. This means that the space spanned by IPI - P is at least as large as the null space of PP. Further, for any vv, (IP)v(I - P)v belongs to the null space of PP as P(IP)v=0P(I - P)v = 0. This means that the space spanned by IPI - P can at most be as large as the null space of PP. Combined, we get exactly the null space of PP.

Complementary projectors allow us to write any vector vv as a direct sum 2 of two subspaces S1=range(P)=null(IP)S_1 = \text{range}(P) = \text{null}(I-P) and S2=null(P)=range(IP)S_2 = \text{null}(P) = \text{range}(I-P). This is only possible because S1S2={0}S_1 \cap S_2 = \{0\}. To see why this is true, we note that any vector in the null space of PP and in the null space of IPI - P can be manipulated as v=vPv=(IP)v=0v = v - Pv = (I - P)v = 0. This implies an important result that any projector PP splits the space into two distinct subspaces. This is why the projector IPI-P complements the projector PP.

Orthogonal Projectors

A projector that splits the space into orthogonal subspaces is known as an orthogonal projector. A projector matrix that is hermitian P=PP^\star = P implies an orthogonal projector. \star represents the complex conjugate transpose.

It is straighforward to see using an inner product that PxPx for any xCNx \in \mathbb{C}^N and any (IP)y(I - P)y for any yCNy \in \mathbb{C}^N are orthogonal when PP is hermitian. Therefore, PP projects on to a space where every vector is orthogonal to the space that IPI - P projects to.

Visually, it is helpful to imagine two orthogonal lines 12\ell_1 \perp \ell_2, (lines are one-dimensional subspaces), such that if PP projects onto 1\ell_1, IPI-P projects onto 2\ell_2.

An elegant little result that I state here without many details - Gram-Schmidt Orthonormalization can be represented as series of recursively built projection matrices.

Projectors onto an arbitrary basis

So far, we’ve discussed the properties of a given projection matrix PP. Imagine the reverse - given any arbitrary basis {b1,b2,,bM}\{b_1, b_2, \dots, b_M\} for an M-dimensional subspace of CN\mathbb{C}^N (N>MN > M), can we construct an orthogonal projector onto this space? The answer is of course, yes. We denote a matrix BB made by stacking the basis vectors {b1,b2,,bM}\{b_1, b_2, \dots, b_M\} as columns.

For any arbitrary vector vv, let yy be the projection through an orthogonal projector PP on to the column space of BB, such that y=Pvy = Pv. By the properties of orthogonal projections we’ve discussed earlier, yvy - v should be orthogonal to the space spanned by the basis represented by columns of BB. It is sufficient to note orthogonality only to every basis vector {b1,b2,,bM}\{b_1, b_2, \dots, b_M\}, succinctly written as B(yv)=0B^\star(y - v) = 0. Since, yy in the column space of BB be represented as y=Bxy = Bx for arbitrary xx, solving for xx, gives us x=(BB)1Bvx = (B^\star B)^{-1}B^\star v. Putting this back into y=Bxy = Bx, we get,

y=B(BB)1BPvy = \underbrace{B(B^\star B)^{-1}B^\star}_{P} v

Comparing this to y=Pvy = Pv, we have the definition of our orthogonal projector.

The Maximum Likelihood Solution

With everything we’ve laid out now, one should find striking similarity between what we have for a maximum likelihood solution wML\mathbf{w}_{ML} and the projector PP. Understanding the explanation should now be straightforward. Let’s state the geometric interpretation for completeness.

The maximum likelihood solution wML\mathbf{w}_{ML} of linear regression (ordinary least squares) is such that ΦwML\Phi \mathbf{w}_{ML} forms an orthogonal projection of y\mathbf{y} onto the subspace spanned by the columns of the design matrix Φ\Phi.

Intuitively, when the number of basis functions MM is less than the number of data NN, ordinary least squares minimizes the error by an orthogonal projection to the subspace spanned by the chosen basis functions. This result is rather natural because the distance to subspace is minimized by an orthogonal projection and linear regression is doing the best it can!

This also reminds us to choose our basis carefully or the inverse operation will suffer as ΦTΦ\Phi^T\Phi can be close to singular. For the sake of completeness, I note that such issues can be addressed by (reduced) singular value decomposition or just regularization which guarantees full rank.

Footnotes

  1. Trefethen, L., & Bau, D. (1997). Numerical Linear Algebra.

  2. Axler, S. (1995). Linear Algebra Done Right.