Thursday, October 27, 2011

A New Dimension to Principal Components Analysis

In general, the standard practice for correcting for population stratification in genetic studies is to use principal components analysis (PCA) to categorize samples along different ethnic axes.  Price et al. published on this in 2006, and since then PCA plots are a common component of many published GWAS studies.  One key advantage to using PCA for ethnicity is that each sample is given coordinates in a multidimensional space corresponding to the varying components of their ethnic ancestry.  Using either full GWAS data or a set of ancestral informative markers (AIMs), PCA can be easily conducted using available software packages like EIGENSOFT or GCTA. HapMap samples are sometimes included in the PCA analysis to provide a frame of reference for the ethnic groups.  

Once computed, each sample will have values that correspond to a position in the new coordinate system that effectively clusters samples together by ethnic similarity.  The results of this analysis are usually plotted/visualized to identify ethnic outliers or to simply examine the structure of the data.  A common problem however is that it may take more than the first two principal components to identify groups.  

To illustrate, I will plot some PCs generated based on 125 AIMs markers for a recent study of ours.  I generated these using GCTA software and loaded the top 5 PCs into R using the read.table() function.  I loaded the top 5, but for continental ancestry, I've found that the top 3 are usually enough to separate groups.  The values look something like this:  
    new_ruid        pc1            pc2               pc3              pc4               pc5
1    11596  4.10996e-03 -0.002883830  0.003100840 -0.00638232  0.00709780
2     5415  3.22958e-03 -0.000299851 -0.005358910  0.00660643  0.00430520
3    11597 -4.35116e-03  0.013282400  0.006398130  0.01721600 -0.02275470
4     5416  4.01592e-03  0.001408180  0.005077310  0.00159497  0.00394816
5     3111  3.04779e-03 -0.002079510 -0.000127967 -0.00420436  0.01257460
6    11598  6.15318e-06 -0.000279919  0.001060880  0.00606267  0.00954331
I loaded this into a dataframe called pca, so I can plot the first two PCs using this command:
plot(pca$pc1, pca$pc2)

We might also want to look at the next two PCs:
plot(pca$pc2, pca$pc3)

 Its probably best to look at all of them together:

So this is where my mind plays tricks on me.  I can't make much sense out of these plots -- there should be four ethnic groups represented, but its hard to see who goes where.  To look at all of these dimensions simultaneously, we need a 3D plot.  Now 3D plots (especially 3D scatterplots) aren't highly regarded -- in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots  -- but in this case I found them quite useful.

Using a library called rgl, I generated a 3D scatterplot like so:

Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out.  Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data.  Different parameters obviously produce different cluster patterns, but I found that using an "ellipsoidal model with equal variances" and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.

fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")
plot3d(pca[2:4], col = fit$classification) 
Basically, the red sphere corresponds to the European descent group, the green indicates the admixed African American group, the black group corresponds to the Hispanic group, and the blue identifying the Asian descent group.  We are still a bit confused as to why the Asian descent samples don't form a more concise cluster -- it may be due to relatively poor performance of these AIMs in Asian descent groups.   Whatever the case, you might notice several individuals falling either outside a clear cluster or at the interface between two groups.  The ethnic assignment for these individuals is questionable, but the clustering algorithm gives us a very nice measure of cluster assignment uncertainty.  We can plot this like so:
plot(pca[2:3], cex = fit$uncertainty*10)
I had to scale the uncertainty factor by 10 to make the questionable points more visible in this plot, shown as the hollow circles.  We will likely drop these samples from any stratified analyses.  We can export the cluster assignment by accessing the fit$classification column, and we have our samples assigned to an ethnic group.


  1. Great! This just gave me new ideas for the visualization of my data. But how did you get these 3d graphs moving?

  2. Will wrote this post - I'm guessing he wrote a script to generate PNGs over many different angles, then used ImageMagick/montage to combine them all together into an animated gif. That's how I'd do it at least. Will?

  3. The rgl package is rather extensive. I used a function called movie3d (with another function called rotate3d) to generate a series of images. If have the imagemagick tool Stephen mentioned installed, it will automagickly generate an animation for you. Otherwise you can use the "convert" command from imagemagick to create the animation from a stack of images.

  4. Could you post the code used to generate the animation?


  5. Why don't you colour your data points in the scatterplot matrices? Instead of using pairs, give the splom function a go (it is part of the lattice package); than you should also be able to colour your labels. I don't really see that the 3D makes too much of a difference until you coloured your points. Also, keep in mind that for papers you can't have a nice rotating image so you're still stuck trying to show something that is 3 dimensional in 2D when you publish (so you'll always occlude something); the splom doesn't occlude anything since it is already in 2D.

  6. Hi,
    I was wondering if somebody could help me. I'm just new to R and bioinformatics and I'm still finding it hard to work everything out. I'm currently doing a GWAS study and having done an MDS-plot I noticed there appears to be a large proportion of cryptic relatedness/population stratification. I wanted to investigate this further and so I have tried to follow the excellent example provided in this blog. I have conducted a PCA analysis using the GCTA software and my first question is, how do I load my data into the dataframe called pca, and read in just the the top 5 PCs. Secondly I would really appreciate if somebody could please provide a step by step guide on how the 3D plot was generated and how I would go about coloring my case and control populations. I apologize if these seem rather mundane questions but being just new to this field it can be very confusing.. I would greatly appreciate any help that can be offered.



  7. Erik,

    I would start by breezing through Peter Dalgaard's book Introductory Statistics with R. It will give you some pretty good help with the basics of R.

  8. excellent, post, thanks a lot!

    I deal myself with PCA and PLS (chemo-, rather than bio-informatics) and will try to contribute something back to the community soon...


Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.