Here, we will explain principle component analysis (PCA) by stepping though the algorithm manually and reproducing the output of the prcomp() function in R, which is normally used to do PCA.
First make up some data and plot it; in terms of gene expression analysis, we can think of the rows of the matrix below as different samples (or microarray experiments etc.), with 2 genes each. Its much easier to explain PCA for two dimensions and then generalise from there.
x <- c(1,3,6,8,9,10,15,12,17)
y <- c(10,14,12,17,19,20,20,24,19)
m <- cbind(x, y)
Here’s how you’d make a PCA plot for this data using R’s inbuilt functions (We’re about to learn how to do this for ourselves.)
pca_proc <- prcomp(m)
plot(pca_proc$x, pch=seq(9)) #the pca plot
plot(pca_proc) #the proportion of variance capture by each PC
Now to do PCA manually, first step, center data around zero, then plot the results.
x1 <- x - mean(x)
y1 <- y - mean(y)
plot(x1, y1, xlim=c(-40, 40), ylim=c(-40, 20))
Next, calculate the covariance matrix for vectors x1 and y1 and plot the column vectors, these column vectors describe the direction of variability in the data, i.e how similar is cov(x, x) with cov(x, y) and how similar is cov(y, y) with cov(y, x). Note that the covariance of x with itself is the same as the variance of x (i.e. var(x) == cov(x, x)). More info on covariance matrix.
m1 <- cbind(x1, y1)
cM <- cov(m1)
lines(x=c(0, cM[1,1]), y=c(0, cM[2,1]), col="blue")
lines(x=c(0, cM[1,2]), y=c(0, cM[2,2]), col="blue")
As the cov(x, y) can never be greater than cov(x,x), one of the lines plotted above will always be above the diagonal and one will be below, plot the diagonal (i.e. a line through the origin with a slope of 1).
abline(0, 1, col="grey")
While the vectors we have plotted above describe the direction of the variability in the data, they do not do so in a way that is particularly useful. By finding the eigenvectors of the covariance matrix, we can describe the variability in terms of two orthogonal vectors which capture the direction of variation, instead of the two vectors that we are currently plotting. For more info on eigenvectors you’ll need to study some basic linear
algebra, this can be done on Khan Academy, of particular interest will be the videos on matrix multiplication and obviously eigenvectors. This is by far the most difficult part of PCA to understand.
eigenVecs <- eigen(cM)$vectors
lines(x=c(0, eigenVecs[1,1]), y=c(0, eigenVecs[2,1]), col="red")
lines(x=c(0, eigenVecs[1,2]), y=c(0, eigenVecs[2,2]), col="red")
As the eigenvectors are unit vectors (i.e. of length 1) it may be easier to visualize how much each of them influences the data if we multiply them by their corresponding eigenvalues, which represent the proportion of variability explained by each eigenvector.
eVal1 <- eigen(cM)$values
eVal2 <- eigen(cM)$values
lines(x=c(0, eVal1*eigenVecs[1,1]), y=c(0, eVal1*eigenVecs[2,1]), col="red")
lines(x=c(0, eVal2*eigenVecs[1,2]), y=c(0, eVal2*eigenVecs[2,2]), col="red")
This matrix contains the eigenvectors and we want to display the data in terms of these eigenvectors. In this case we will select both eigenvectors, but on high dimensional datasets, it is normal to chose a subset of eigenvectors.
rowFeatVec <- t(eigenVecs)
rowDataAdj <- t(m1)
Finally use matrix multiplcation to get the dot product between each eigenvector and each point in the original centered data matrix (m1). This operation describes to what degree each point is influenced by each eigenvector Plot this and that’s the final plot.
Note, %*% is the matrix multiplication operator in R.
transFData <- rowFeatVec %*% rowDataAdj
finalData <- rowFeatVec %*% rowDataAdj
Finally, to plot the equivalent of the scree plot we made above, simply plot the eigen values.