Poisson is Fish: Principal Component Analysis in R

 

I share today here a most recent post from a nice Blog called PoissonIsFish. This is a blog from a Portuguese(Brazilian?) speaker doing a PhD in Berlin, Germany.

This is a long and accessible demonstration and explanation on how to do Principal Component Analysis with the R programming language.

Principal Component Analysis in R

 

Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of outliers to predictive modeling, PCA has the ability of projecting the observations described by p variables into few orthogonal components defined at where the data ‘stretch’ the most, rendering a simplified overview. PCA is particularly powerful in dealing with multicollinearity and variables that outnumber the samples (p \gg n ).

 

q7hip

It is an unsupervised method, meaning it will always look into the greatest sources of variation regardless of the data structure. Its counterpart, the partial least squares (PLS), is a supervised method and will perform the same sort of covariance decomposition, albeit building a user-defined number of components (frequently designated as latent variables) that minimize the SSE from predicting a specified outcome with an ordinary least squares (OLS). The PLS is worth an entire post and so I will refrain from casting a second spotlight.

In case PCA is entirely new to you, there is an excellent Primer from Nature Biotechnology that I highly recommend. Notwithstanding the focus on life sciences, it should still be clear to others than biologists.

Mathematical foundation

 

There are numerous PCA formulations in the literature dating back as long as one century, but all in all PCA is pure linear algebra. One of the most popular methods is the singular value decomposition (SVD). The SVD algorithm breaks down a matrix X of size n \times p  into three pieces,

X = U \Sigma V^T

where U  is the matrix with the eigenvectors of X X^T \Sigma is the diagonal matrix with the singular values and V^T  is the matrix with the eigenvectors of X^T X . These matrices are of size n \times n n \times p and p \times p , respectively. The key difference of SVD compared to a matrix diagonalization (X = V \Sigma V^{-1} ) is that U  and V^T  are distinct orthonormal (orthogonal and unit-vector) matrices.

(…)

PCA reduces the p dimensions of your data set X down to k principal components (PCs). The scores from the first k PCs result from multiplying the first k columns of U with the k \times k  upper-left submatrix of \Sigma . The loading factors of the k^{th} PC are directly given in the k^{th}  row in V^T . Consequently, multiplying all scores and loadings recovers X . You might as well keep in mind:

 

 

  • PCs are ordered by the decreasing amount of variance explained
  • PCs are orthogonal i.e. uncorrelated to each other
  • The columns of X should be mean-centered, so that the covariance matrix \approx X^{T} X
  • SVD-based PCA does not tolerate missing values (but there are solutions we will cover shortly)

For a more elaborate explanation with introductory linear algebra, here is an excellent free SVD tutorial I found online. At any rate, I guarantee you can master PCA without fully understanding the process.

Let’s get started with R

Although there is a plethora of PCA methods available for R, I will only introduce two,

 

  • prcomp, a default function from the R base package

 

  • pcaMethods, a Bioconductor package that I frequently use for my own PCAs

 

I will start by demonstrating that prcomp is based on the SVD algorithm, using the base svd function.

# Generate scaled 4*5 matrix with random std normal samples
set.seed(101)
mat <- scale(matrix(rnorm(20), 4, 5))
dimnames(mat) <- list(paste("Sample", 1:4), paste("Var", 1:5))
# Perform PCA
myPCA <- prcomp(mat, scale. = F, center = F)
myPCA$rotation # loadings
myPCA$x # scores
By default, prcomp will retrieve min(n, p) PCs. Therefore, in our setting we expect having four PCs.The svd function will behave the same way:
# Perform SVD
mySVD <- svd(mat)
mySVD # the diagonal of Sigma mySVD$d is given as a vector
sigma <- matrix(0,4,4) # we have 4 PCs, no need for a 5th column
diag(sigma) <- mySVD$d # sigma is now our true sigma matrix

Now that we have the PCA and SVD objects, let us compare the respective scores and loadings. We will compare the scores from the PCA with the product of U and \Sigma from the SVD. In R, matrix multiplication is possible with the operator %*%. Next, we will directly compare the loadings from the PCA with V from the SVD, and finally show that multiplying scores and loadings recovers X . I rounded the results to five decimal digits since the results are not exactly the same! The function t retrieves a transposed matrix.

 

# Compare PCA scores with the SVD's U*Sigma
theoreticalScores <- mySVD$u %*% sigma
all(round(myPCA$x,5) == round(theoreticalScores,5)) # TRUE
# Compare PCA loadings with the SVD's V
all(round(myPCA$rotation,5) == round(mySVD$v,5)) # TRUE
# Show that mat == U*Sigma*t(V)
recoverMatSVD <- theoreticalScores %*% t(mySVD$v)
all(round(mat,5) == round(recoverMatSVD,5)) # TRUE
#Show that mat == scores*t(loadings)
recoverMatPCA <- myPCA$x %*% t(myPCA$rotation)
all(round(mat,5) == round(recoverMatPCA,5)) # TRUE
The post continues with some nice implementations of PCA to a wine data set and a data set of Housing characteristics from an US survey (here it is described a PCR, that is feeding data to a PCA via Ordinary Least Squares to perform a regression analysis). Interesting reading.
rplot03
As for the concluding remarks, it goes as follows:

Concluding,

 

  • The SVD algorithm is founded on fundamental properties of linear algebra including matrix diagonalization. SVD-based PCA takes part of its solution and retains a reduced number of orthogonal covariates that explain as much variance as possible.

 

  • Use PCA when handling high-dimensional data. It is insensitive to correlation among variables and efficient in detecting sample outliers.

 

  • If you plan to use PCA results for subsequent analyses all care should be undertaken in the process. Although typically outperformed by numerous methods, PCR still benefits from interpretability and can be effective in many settings.

 

All feedback from these tutorials is very welcome, please enter the Contact tab and leave your comments. I do also appreciate suggestions. Enjoy!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s