Thursday, October 27, 2011

A New Dimension to Principal Components Analysis



In general, the standard practice for correcting for population stratification in genetic studies is to use principal components analysis (PCA) to categorize samples along different ethnic axes.  Price et al. published on this in 2006, and since then PCA plots are a common component of many published GWAS studies.  One key advantage to using PCA for ethnicity is that each sample is given coordinates in a multidimensional space corresponding to the varying components of their ethnic ancestry.  Using either full GWAS data or a set of ancestral informative markers (AIMs), PCA can be easily conducted using available software packages like EIGENSOFT or GCTA. HapMap samples are sometimes included in the PCA analysis to provide a frame of reference for the ethnic groups.  

Once computed, each sample will have values that correspond to a position in the new coordinate system that effectively clusters samples together by ethnic similarity.  The results of this analysis are usually plotted/visualized to identify ethnic outliers or to simply examine the structure of the data.  A common problem however is that it may take more than the first two principal components to identify groups.  

To illustrate, I will plot some PCs generated based on 125 AIMs markers for a recent study of ours.  I generated these using GCTA software and loaded the top 5 PCs into R using the read.table() function.  I loaded the top 5, but for continental ancestry, I've found that the top 3 are usually enough to separate groups.  The values look something like this:  
    new_ruid        pc1            pc2               pc3              pc4               pc5
1    11596  4.10996e-03 -0.002883830  0.003100840 -0.00638232  0.00709780
2     5415  3.22958e-03 -0.000299851 -0.005358910  0.00660643  0.00430520
3    11597 -4.35116e-03  0.013282400  0.006398130  0.01721600 -0.02275470
4     5416  4.01592e-03  0.001408180  0.005077310  0.00159497  0.00394816
5     3111  3.04779e-03 -0.002079510 -0.000127967 -0.00420436  0.01257460
6    11598  6.15318e-06 -0.000279919  0.001060880  0.00606267  0.00954331
I loaded this into a dataframe called pca, so I can plot the first two PCs using this command:
plot(pca$pc1, pca$pc2)

We might also want to look at the next two PCs:
plot(pca$pc2, pca$pc3)

 Its probably best to look at all of them together:
pairs(pca[2:4])


So this is where my mind plays tricks on me.  I can't make much sense out of these plots -- there should be four ethnic groups represented, but its hard to see who goes where.  To look at all of these dimensions simultaneously, we need a 3D plot.  Now 3D plots (especially 3D scatterplots) aren't highly regarded -- in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots  -- but in this case I found them quite useful.

Using a library called rgl, I generated a 3D scatterplot like so:
 plot3d(pca[2:4])


Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out.  Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data.  Different parameters obviously produce different cluster patterns, but I found that using an "ellipsoidal model with equal variances" and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.

fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")
plot3d(pca[2:4], col = fit$classification) 
Basically, the red sphere corresponds to the European descent group, the green indicates the admixed African American group, the black group corresponds to the Hispanic group, and the blue identifying the Asian descent group.  We are still a bit confused as to why the Asian descent samples don't form a more concise cluster -- it may be due to relatively poor performance of these AIMs in Asian descent groups.   Whatever the case, you might notice several individuals falling either outside a clear cluster or at the interface between two groups.  The ethnic assignment for these individuals is questionable, but the clustering algorithm gives us a very nice measure of cluster assignment uncertainty.  We can plot this like so:
plot(pca[2:3], cex = fit$uncertainty*10)
I had to scale the uncertainty factor by 10 to make the questionable points more visible in this plot, shown as the hollow circles.  We will likely drop these samples from any stratified analyses.  We can export the cluster assignment by accessing the fit$classification column, and we have our samples assigned to an ethnic group.



5 comments:

  1. Great! This just gave me new ideas for the visualization of my data. But how did you get these 3d graphs moving?

    ReplyDelete
  2. Will wrote this post - I'm guessing he wrote a script to generate PNGs over many different angles, then used ImageMagick/montage to combine them all together into an animated gif. That's how I'd do it at least. Will?

    ReplyDelete
  3. The rgl package is rather extensive. I used a function called movie3d (with another function called rotate3d) to generate a series of images. If have the imagemagick tool Stephen mentioned installed, it will automagickly generate an animation for you. Otherwise you can use the "convert" command from imagemagick to create the animation from a stack of images.

    ReplyDelete
  4. Could you post the code used to generate the animation?

    Thanks

    ReplyDelete
  5. Why don't you colour your data points in the scatterplot matrices? Instead of using pairs, give the splom function a go (it is part of the lattice package); than you should also be able to colour your labels. I don't really see that the 3D makes too much of a difference until you coloured your points. Also, keep in mind that for papers you can't have a nice rotating image so you're still stuck trying to show something that is 3 dimensional in 2D when you publish (so you'll always occlude something); the splom doesn't occlude anything since it is already in 2D.

    ReplyDelete

 
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at GettingGeneticsDone.blogspot.com.