PCA through the lens of Linear Algebra

,
9 minute read

Visual demonstration of the PCA dimension reduction procedure created with manim.

Motivation for PCA

If you have heard of principal component analysis (PCA) before, you might know that it is related to reducing the dimension of data. You may also remember that this technique has a connection to eigenvectors and eigenvalue. In this article, I explore why PCA is so intimately connected to concepts from linear algebra without using any calculus.

The curse of dimensionality can give rise to many issues when it comes to working with high-dimensional data. This motivates the desire to reduce the dimensionality of data.

A standard setting you have likely seen before is a dataset with \(p\) real-valued covariates and \(n\) samples stacked in an \(n \times p\) matrix \(X \in \mathbb R^{n \times p}\).1

A reasonable place to begin would be to find a way to describe each \(p\)-dimensional observation with a single number. One basic approach would be to choose one covariate (random variable / column of the design matrix) to keep, discarding the rest. Which variable should we choose to keep? Consider the following example:

\[X_1 = \begin{bmatrix} 1 & 1\\ 2 & 1 \\ 3 & 1.3 \end{bmatrix}\]

The first variable is \(\boldsymbol{x_1} =(1,2,3)^{\mathrm{\scriptscriptstyle T} }\) and the second is \(\boldsymbol{x_2} =(1,1,1.3)^{\mathrm{\scriptscriptstyle T} }\).

If we delete \(\boldsymbol{x_1}\), then we can no longer distinguish between the first two observations since the first two elements of \(\boldsymbol{x_2}\) are the same. If we delete \(\boldsymbol{x_2}\), then we can still distinguish between the different observations in our design matrix. In an attempt to maintain a high level of “distinguishability” or information, we may choose to save the first column of our matrix while discarding the second.

When we encounter more complicated examples, one way to figure out which random variable allows us to keep the most information possible is to

  1. Compute the sample variance of each column present and then
  2. Choose to keep the column with the largest sample variance.

Let’s do this for the toy example, \(X_1\).

Let \(\overline x=\frac{1}{n}\sum_{i=1}^n x_i\) denote the sample mean and \(S^2_x = \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2\) denote the sample variance of a random variable \(x\) with observed values \(x_1,\cdots,x_n\). Then,

\[S^2_{x_1} = \frac{1}{3-1} \left( (1-2)^2+(2-2)^2+(3-2)^2 \right)=1\] \[S^2_{x_2} = \frac{1}{3-1} \left( (1-1.1)^2+(1-1.1)^2 +(1.3-1.1)^2 \right)= 0.03\]

Since the sample variance of the first column is largest, we will choose to delete the second column.

In PCA, we go one step beyond selecting a single random variable and try to find a linear combination of our random variables with maximum observed variance. Let’s figure out how to formulate this optimization problem mathematically.

Mathematical Formulation of PCA

Since we are working with sample variances, our math becomes a bit easier if we center our columns by subtracting their sample means from them. Let’s do this and assume from now on that each column of \(X\) has a sample mean of \(0\).

An arbitrary linear combination of our \(p\) random variables, \(\boldsymbol{x_1}, \dots, \boldsymbol{x_p}\) look likes

\[\boldsymbol{x_u} := u_1 \boldsymbol{x_1} + \dots +u_p \boldsymbol{x_p} = Xu\]

where \(\boldsymbol{u}=(u_1,\cdots, u_p)^{\mathrm{\scriptscriptstyle T} } \in \mathbb R^p\). Note that

\[\overline{\boldsymbol{x_u}} = 0\]

because we centered the columns of \(X\), as one can easily verify. We denote the usual Euclidean inner product (dot product) with \(\langle \cdot, \cdot \rangle \rightarrow \mathbb R \) defined as

\[\langle \boldsymbol{x}, \boldsymbol{y} \rangle \mapsto x_1y_1+\dots+x_py_p = \boldsymbol{x}^{\mathrm{\scriptscriptstyle T} }\boldsymbol{y}\]

for \(\boldsymbol{x},\boldsymbol{y} \in \mathbb R^p\).

Thus, the sample variance of \(\boldsymbol{x_u}\) is

\[\boldsymbol{x_u}^{\mathrm{\scriptscriptstyle T} }\boldsymbol{x_u} =(X\boldsymbol{u})^{\mathrm{\scriptscriptstyle T} }(X\boldsymbol{u})=\boldsymbol{u}^{\mathrm{\scriptscriptstyle T} }X^TX\boldsymbol{u}\]

Notice that if we fix some \(u \in \mathbb R^p\) and scale it by a nonzero scalar \(a \in \mathbb R\), the resulting linear combination remains fundamentally identical. Specifically, if we let \(\boldsymbol{u}’=a\boldsymbol{u}\) then, the proportional influence of each \(x_i\) on the whole remains unchanged, meaning that \(\boldsymbol{x_{u’}}\) and \(\boldsymbol{x_{u}}\) are effectively identical in this sense.

Moreover, the sample variance of \(\boldsymbol{x_{u’}}\) is the sample variance of \(\boldsymbol{x_u}\) multiplied by a factor of \(a^2:\)

\[S^2_{\boldsymbol{x_{u'}}}=\boldsymbol{u'}^{\mathrm{\scriptscriptstyle T} }X^{\mathrm{\scriptscriptstyle T} }X\boldsymbol{u'}=(a\boldsymbol{u})^{\mathrm{\scriptscriptstyle T} }X^{\mathrm{\scriptscriptstyle T} }Xa\boldsymbol{u}=a^2\boldsymbol{u}^{\mathrm{\scriptscriptstyle T} }X^{\mathrm{\scriptscriptstyle T} }X\boldsymbol{u}=a^2S^2_{\boldsymbol{x_{u}}}\]

Because the variance scales as the square of the scalar \(a\), we choose to restrict the Euclidean Norm of \(\boldsymbol{u}\) to be 1. That is, we want to find \(\boldsymbol{u}\) such that \(\boldsymbol{u}^{\mathrm{\scriptscriptstyle T} }\boldsymbol{u}=1\) and the variance, \(\boldsymbol{u}^{\mathrm{\scriptscriptstyle T} }X^{\mathrm{\scriptscriptstyle T} }X\boldsymbol{u}\), is maximized.

Linear Algebraic Solution

This optimization problem can be solved using Lagrange Multipliers from calculus. We will try to approach this problem from a linear algebraic perspecitve. Specifically, we will take advantage of nice properties that the covariance matrix \(X^{\mathrm{\scriptscriptstyle T} }X\) has.

We might notice that the sample variance of the linear combination given by \(u\) can be written as an inner product

\[\boldsymbol{u}^{\mathrm{\scriptscriptstyle T} }X^{\mathrm{\scriptscriptstyle T} }X\boldsymbol{u} = \langle \boldsymbol{u}, X^{\mathrm{\scriptscriptstyle T} }X\boldsymbol{u} \rangle\]

We observe that \(X^{\mathrm{\scriptscriptstyle T} }X \in \mathbb R^{p \times p}\). As we know, this is a linear transformation from \(\mathbb R^p \) to itself. In linear algebra this is called a linear operator on \(\mathbb R^p\). Moreover, recall that \(X^{\mathrm{\scriptscriptstyle T} }X\) is always a symmetric matrix (no matter what is found in \(X\)). A symmetric matrix represents a self-adjoint operator.

Then by the real spectral theorem, there is an orthonormal basis of \(\mathbb R^p\) consisting of eigenvectors of \(X^{\mathrm{\scriptscriptstyle T}}\). Let \(\boldsymbol e_1,\cdots,\boldsymbol e_p\) be the eigenvectors which make up the orthonormal basis2 with corresponding eigenvalues \(\lambda_1 \geq \cdots \geq \lambda_p\).

As you might know, inner products simplify enourmously when you express vector in terms of an orthonormal basis. Let’s write \(\boldsymbol{u}\) as a linear combination of our orthonormal basis vectors!

\[\boldsymbol u = a_1 \boldsymbol e_1+\cdots+a_p\boldsymbol e_p\]

where \(a_i \in \mathbb R\) for each \(i=1,\cdots,p\). Now let’s use this representation to rewrite the sample variance of \(\boldsymbol{x_u}\) in terms of the eigenvalues:

\[\begin{align*} \langle \boldsymbol{u}, X^{\mathrm{\scriptscriptstyle T} }X\boldsymbol{u} \rangle &= \langle a_1 \boldsymbol e_1+\cdots+a_p\boldsymbol e_p, X^{\mathrm{\scriptscriptstyle T} }X(a_1 \boldsymbol e_1+\cdots+a_p\boldsymbol e_p) \rangle\\ &= \langle a_1 \boldsymbol e_1+\cdots+a_p\boldsymbol e_p, a_1 \lambda_1 \boldsymbol e_1+\cdots+a_p \lambda_p\boldsymbol e_p \rangle \\ &= a_1^2\lambda_1+\cdots+a_p^2 \lambda_p \end{align*}\]

What insight does this give us for the posed optimization problem? Well, \(\boldsymbol u\) is determined by the values of the \(a_i\). Moreover, \(\boldsymbol{u}^{\mathrm{\scriptscriptstyle T} }\boldsymbol{u}=1\) implies that \(a_1^2+\cdots +a_p^2=1 \) and we can thus think of

\[a_1^2\lambda_1+\cdots+a_p^2 \lambda_p\]

as a weighted average of the eigenvalues \(\lambda_1,\cdots, \lambda_p \). Since we arranged the eigenvalues in decreasing order, it follows that

\[a_1^2\lambda_1+\cdots+a_p^2 \lambda_p \leq a_1^2\lambda_1+\cdots+a_p^2 \lambda_1=\lambda_1(a_1^2+\cdots+a_p^2)=\lambda_1\]

In the above ineqiuality, we replaced \(\lambda_i \) with \(\lambda_1\) for every \(i = 2, \cdots, p\) since \(\lambda_1\) is the largest eigenvalue.

We note that all the eigenvalues of \(X^{\mathrm{\scriptscriptstyle T} }X\) are non-negative and include a short proof3 in the footnotes.

We have shown that the maximum possible sample variance of \(\boldsymbol{x_u} \) is \(\lambda_1\). Moreover, this maximal sample variance is attained when \(\boldsymbol u=\boldsymbol e_1\) because in this case, \(a_1=1\) and all other \(a_i\) are equal to zero. Recall that \(\boldsymbol u\) tells us how to linearly assemble our \(p\) random variables!

We have successfully figured out which linear combination to take which preserves the most variance. Namely, it is \(\boldsymbol{x_{e_1}}\).

The assembly with coefficients given by the eigenvector with largest eigenvalue is called the first principal component of our set of \(p\) random variables.

More Principal Components?

Once we have the first principal component, we might want others (you often don’t want to go all the way from \(p\) dimensions down to a single dimension).

What other linear combinations should we use? Just as in our orignal idea where we would keep or delete invididual variables, after we decide to keep a variable, we do not want to choose it again. Therefore, when selecting a second principal component we do not want to include any information contained in the first component.

Statistically, this means choosing a new \(\boldsymbol u \in \mathbb R^p\) such that the sample covarinace of \(\boldsymbol {x_u}\) with \(\boldsymbol{x_{e_1}}\) is zero but itself has maximal sample variance.

In the language of inner products, this means choosing \(\boldsymbol u =a_1\boldsymbol e_1+\cdots+a_p\boldsymbol e_p\) such that \(a_1^2+\cdots+a_p^2=1 \) and \(\langle u, e_1 \rangle =0 \ \). Note that

\[\langle u, e_1 \rangle =0 \text{ if and only if } a_1=0\]

Similar to when we were trying to find the first principal component, this problem reduces to choosing \(a_2,\cdots, a_p \) such that

\[a_2^2\lambda_1+\cdots+a_p^2 \lambda_p\]

and \(a_2^2+\cdots+a_p^2=1\)

We can apply the same reasoning we did before to find that the solution is given by \(a_2=1\) and all other \(a_i=0\). Similarly to before, this means that the coefficients for the second pricipal component come from the elements of the eigenvector of \(X^{\mathrm{\scriptscriptstyle T} }X\) with second largest eigenvalue.

This pattern repeats and gives us a total of \(p\) principal components.

  1. We use \(\mathbb R^{n \times p}\) to denote the set of \(n \times p\) real-valued matrices. 

  2. An orthonormal basis is one which every vector has norm equal to 1 and the basis vectors are pairwise orthogonal. 

  3. By the spectral theorem there exist matricies \(D\) (diagonal matrix whose diagonal values are the eigenvalues of \(X^{\mathrm{\scriptscriptstyle T} }X\)) and \(P\) such that \(X^{\mathrm{\scriptscriptstyle T} }X = PDP^{\mathrm{\scriptscriptstyle T} }\). Then, for any vector \(\boldsymbol{v} \in \mathbb R^p\) we have \(0 \geq \langle X\boldsymbol{v}, X\boldsymbol{v} \rangle = v^{\mathrm{\scriptscriptstyle T} }X^{\mathrm{\scriptscriptstyle T} }Xv = (P\boldsymbol{v})^{\mathrm{\scriptscriptstyle T} }D (P\boldsymbol{v})=\boldsymbol{y}^{\mathrm{\scriptscriptstyle T} }D\boldsymbol y=\lambda_1y_1^2+\cdots+\lambda_p y_p^2\). Since \(P\) is an invertible matrix, we can find \(\boldsymbol u\) such that for each \(i\in {1,\cdots, p}\) we have \(y_i=1\) and all \(y_k=0\) where \(k \neq i\). Thus \(\lambda_i \geq 0\). Note that in the above \(y=Pv\). 

Leave a Comment