Linear regression is one of the most well studied machine learning algorithms. A geometric perspective also makes rounds as to what the maximum likelihood solution of a linear regression problem signifies, although the explanations are often imprecise and hand-wavy. This post will hopefully be a helpful summary to everyone who was searching for a precise argument around the geometric interpretation of a least squares solution.
As usual, linear regression involves a dataset of pairs of inputs, in and outputs in , . The objective is to learn the parameters of a linear function of the form [^b]
where is data-independent Gaussian additive noise with variance. represents the vector of features extracted from each input variable using a set of basis functions . We want to generalize beyond the training data and quantify our performance using the mean squared error as the notion of risk. We will not be Bayesian about this problem for now and instead focus on getting a point value called the maximum likelihood estimate, as is commonly used. The key result typically shown in an introductory machine learning class for the maximum likelihood estimate under mean squared error is reproduced below.
where is the column vector in consisting of all the output values from our training data set and is a design matrix consisting of in each row as brlow.
We are interested in the geometric interpretation of this and we use the formal tool from linear algebra called orthogonal projectors. Familiarity with basic linear algebra is assumed.
With the lack of visuals for now, imagine everything in two or three dimensions. Everything can be generalized to finite higher-dimensional spaces.
A projector 1 is a square matrix that satisfies . In plain words, this means that for any vector , is the projection of it into a column-space of . Further, re-applying the projection to this new vector keeps the new vector intact. This is also known as an idempotent matrix.
If is a projector, then is also a projector, where is identity matrix of compatible dimensions. This is known as the complementary projector, for reasons we describe next.
projects exactly onto the null space of . To see why, we note that for any vector in the null space of , . This means that the space spanned by is at least as large as the null space of . Further, for any , belongs to the null space of as . This means that the space spanned by can at most be as large as the null space of . Combined, we get exactly the null space of .
Complementary projectors allow us to write any vector as a direct sum 2 of two subspaces and . This is only possible because . To see why this is true, we note that any vector in the null space of and in the null space of can be manipulated as . This implies an important result that any projector splits the space into two distinct subspaces. This is why the projector complements the projector .
A projector that splits the space into orthogonal subspaces is known as an orthogonal projector. A projector matrix that is hermitian implies an orthogonal projector. represents the complex conjugate transpose.
It is straighforward to see using an inner product that for any and any for any are orthogonal when is hermitian. Therefore, projects on to a space where every vector is orthogonal to the space that projects to.
Visually, it is helpful to imagine two orthogonal lines , (lines are one-dimensional subspaces), such that if projects onto , projects onto .
An elegant little result that I state here without many details - Gram-Schmidt Orthonormalization can be represented as series of recursively built projection matrices.
So far, we've discussed the properties of a given projection matrix . Imagine the reverse - given any arbitrary basis for an M-dimensional subspace of (), can we construct an orthogonal projector onto this space? The answer is of course, yes. We denote a matrix made by stacking the basis vectors as columns.
For any arbitrary vector , let be the projection through an orthogonal projector on to the column space of , such that . By the properties of orthogonal projections we've discussed earlier, should be orthogonal to the space spanned by the basis represented by columns of . It is sufficient to note orthogonality only to every basis vector , succinctly written as . Since, in the column space of be represented as for arbitrary , solving for , gives us . Putting this back into , we get,
Comparing this to , we have the definition of our orthogonal projector.
With everything we've laid out now, one should find striking similarity between what we have for a maximum likelihood solution and the projector . Understanding the explanation should now be straightforward. Let's state the geometric interpretation for completeness.
The maximum likelihood solution of linear regression (ordinary least squares) is such that forms an orthogonal projection of onto the subspace spanned by the columns of the design matrix .
Intuitively, when the number of basis functions is less than the number of data , ordinary least squares minimizes the error by an orthogonal projection to the subspace spanned by the chosen basis functions. This result is rather natural because the distance to subspace is minimized by an orthogonal projection and linear regression is doing the best it can!
This also reminds us to choose our basis carefully or the inverse operation will suffer as can be close to singular. For the sake of completeness, I note that such issues can be addressed by (reduced) singular value decomposition or just regularization which guarantees full rank.