Thursday, October 27, 2011

A New Dimension to Principal Components Analysis



In general, the standard practice for correcting for population stratification in genetic studies is to use principal components analysis (PCA) to categorize samples along different ethnic axes.  Price et al. published on this in 2006, and since then PCA plots are a common component of many published GWAS studies.  One key advantage to using PCA for ethnicity is that each sample is given coordinates in a multidimensional space corresponding to the varying components of their ethnic ancestry.  Using either full GWAS data or a set of ancestral informative markers (AIMs), PCA can be easily conducted using available software packages like EIGENSOFT or GCTA. HapMap samples are sometimes included in the PCA analysis to provide a frame of reference for the ethnic groups.  

Once computed, each sample will have values that correspond to a position in the new coordinate system that effectively clusters samples together by ethnic similarity.  The results of this analysis are usually plotted/visualized to identify ethnic outliers or to simply examine the structure of the data.  A common problem however is that it may take more than the first two principal components to identify groups.  

To illustrate, I will plot some PCs generated based on 125 AIMs markers for a recent study of ours.  I generated these using GCTA software and loaded the top 5 PCs into R using the read.table() function.  I loaded the top 5, but for continental ancestry, I've found that the top 3 are usually enough to separate groups.  The values look something like this:  
    new_ruid        pc1            pc2               pc3              pc4               pc5
1    11596  4.10996e-03 -0.002883830  0.003100840 -0.00638232  0.00709780
2     5415  3.22958e-03 -0.000299851 -0.005358910  0.00660643  0.00430520
3    11597 -4.35116e-03  0.013282400  0.006398130  0.01721600 -0.02275470
4     5416  4.01592e-03  0.001408180  0.005077310  0.00159497  0.00394816
5     3111  3.04779e-03 -0.002079510 -0.000127967 -0.00420436  0.01257460
6    11598  6.15318e-06 -0.000279919  0.001060880  0.00606267  0.00954331
I loaded this into a dataframe called pca, so I can plot the first two PCs using this command:
plot(pca$pc1, pca$pc2)

We might also want to look at the next two PCs:
plot(pca$pc2, pca$pc3)

 Its probably best to look at all of them together:
pairs(pca[2:4])


So this is where my mind plays tricks on me.  I can't make much sense out of these plots -- there should be four ethnic groups represented, but its hard to see who goes where.  To look at all of these dimensions simultaneously, we need a 3D plot.  Now 3D plots (especially 3D scatterplots) aren't highly regarded -- in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots  -- but in this case I found them quite useful.

Using a library called rgl, I generated a 3D scatterplot like so:
 plot3d(pca[2:4])


Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out.  Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data.  Different parameters obviously produce different cluster patterns, but I found that using an "ellipsoidal model with equal variances" and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.

fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")
plot3d(pca[2:4], col = fit$classification) 
Basically, the red sphere corresponds to the European descent group, the green indicates the admixed African American group, the black group corresponds to the Hispanic group, and the blue identifying the Asian descent group.  We are still a bit confused as to why the Asian descent samples don't form a more concise cluster -- it may be due to relatively poor performance of these AIMs in Asian descent groups.   Whatever the case, you might notice several individuals falling either outside a clear cluster or at the interface between two groups.  The ethnic assignment for these individuals is questionable, but the clustering algorithm gives us a very nice measure of cluster assignment uncertainty.  We can plot this like so:
plot(pca[2:3], cex = fit$uncertainty*10)
I had to scale the uncertainty factor by 10 to make the questionable points more visible in this plot, shown as the hollow circles.  We will likely drop these samples from any stratified analyses.  We can export the cluster assignment by accessing the fit$classification column, and we have our samples assigned to an ethnic group.



Tuesday, October 18, 2011

My thoughts on ICHG 2011

I’m a bit exhausted from a week of excellent science at ICHG. First, let me say that Montreal is a truly remarkable city with fantastic food and a fascinating blend of architectural styles, all making the meeting a fun place to be…. Now on to the genomics – I’ll recap a few of the most exciting sessions I attended. You can find a live-stream of tweets from the meeting by searching the #ICHG2011 and #ICHG hashtags.


On Wednesday, Marylyn Ritchie(@MarylynRitchie) and Nancy Cox organized “Beyond Genome-wide association studies”.  Nancy Cox presented some ideas on how to integrate multiple “intermediate” associations for SNPs, such as expression QTLs and newly discovered protein QTLs (More on pQTLs later). This approach which she called a Functional Unit Analysis would group signals together based on the genes they influence. Nicholas Shork presented some nice examples of pros and cons of sequence level annotation algorithms.  Trey Idekker gave a very nice talk illustrating some of the properties of epistasis in yeast protein interaction networks. One of the more striking points he made was that epistasis tends to occur between full protein complexes rather than within elements of the complexes themselves. Marylyn Ritchie presented the ideas behind her ATHENA software for machine learning analysis of genetic data, and Manuel Mattesian from Tim Becker’s group presented the methods in their INTERSNP software for doing large-scale interaction analysis. What was most impressive with this session is that there were clear attempts to incorporate underlying biological complexity into data analysis.

On Thursday, I attended the second Statistical Genetics section called “Expanding Genome-wide Association Studies”, organized by Saurabh Ghosh and Daniel Shriner. Having recently attended IGES, I feel pretty “up” on newer analysis techniques, but this session had a few talks that sparked my interest. The first three talks were related to haplotype phasing and the issues surrounding computational accuracy and speed. The basic goal of all these methods is to efficiently estimate genotypes for a common set of loci for all samples of a study using a set of reference haplotypes, usually from the HapMap or 1000 genomes data. Despite these advances, it seems like phasing haplotypes for thousands of samples is still a massive undertaking that requires a high-performance computing cluster. There were several talks about ongoing epidemiological studies, including the Kaiser Permanente UCSF cohort. Neil Risch presented an elegant study design implementing four custom GWAS chips for the four targeted populations. Looks like the data hasn't started to flow from this yet, but when it does we’re sure to learn about lots of interesting ethnic-specific disease effects. My good friend and colleague Dana Crawford presented an in silico GWAS study of hypothyroidism. In her best NPR voice, Dana showed how electronic medical records with GWAS data in the EMERGE network can be re-used to construct entirely new studies nested within the data collected for other specific disease purposes. Her excellent Post-Doc, Logan Dumitrescu presented several gene-environment interactions between Lipid levels and vitamin A and E from Dana’s EAGLE study. Finally Paul O’Reilly presented a cool new way to look at multiple phenotypes by essentially flipping a typical regression equation around, estimating coefficients that relate each phenotype in a study to a single SNP genotype as an outcome. This rather clever approach called MultiPhen is similar to log-linear models I’ve seen used for transmission-based analysis, and allows you to model the “interaction” among phenotypes in much the same way you would look at SNP interactions.

 By far the most interesting talks of the meeting (for me) were in the Genomics section on Gene Expression, organized by Tomi Pastinen and Mark Corbett. Chris Mason started the session off with a fantastic demonstration of the power of RNA-seq. Examining transcriptomes of 14 non-human primate species, they validated many of the computational predictions in the AceView gene build, and illustrated that most “exome” sequencing is probably examining less than half of all transcribed sequences. Rupali Patwardhan talked about a system for examining the impact of promoter and enhancer mutations in whole mice, essentially using mutagenesis screens to localize these regions. Ron Hause presented work on the protein QTLs that Nancy Cox alluded to earlier in the conference. Using a high-throughput form of western blots, they systematically examined levels for over 400 proteins in the Yoruba HapMap cell lines. They also illustrate that only about 50% of eQTLs identified in these lines actually alter protein levels. Stephen Montgomery spoke about the impact of rare genetic variants within a transcript on transcript levels. Essentially he showed an epistatic effect on expression, where transcripts with deleterious alleles are less likely to be expressed – an intuitive and fascinating finding, especially for those considering rare-variant analysis. Athma Pai presented a new QTL that influences mRNA decay rates. By measuring multiple time points using RNA-seq, she found individual-level variants that alter decay, which she calls dQTLs. Veronique Adoue looked at cis-eQTLs relative to transcription factor binding sites using ChIP, and Alfonso Buil showed how genetic variants influence gene expression networks (or correlation among gene expression) across tissue types.

 I must say despite all the awesome work presented in this session, Michael Snyder stole the show with his talk on the “Snyderome” – his own personal –omics profile collected over 21 months. His whole-genome was sequenced by Complete Genomics, and processed using Rong Chen and Atul Butte’s risk-o-gram to quantify his disease risk. His profile predicted increased risk of T2D, so he began collecting glucose measures and low and behold, he saw a sustained spike in blood glucose levels following a few days following a common cold. His interpretation was that an environmental stress knocked him into a pseudo-diabetic state, and his transcriptome and proteome results corroborated this idea. Granted, this is an N of 1, and there is still lots of work to be done before this type of analysis revolutionizes medicine, but the take home message is salient – multiple -omics are better than one, and everyone’s manifestation of a complex disease is different. This was truly thought-provoking work, and it nicely closed an entire session devoted to understanding the intermediate impact of genetic variants to better understand disease complexity.

This is just my take of a really great meeting -- I'm sure I missed lots of excellent talks.  If you saw something good please leave a comment and share!

Saturday, October 8, 2011

Find me at ICHG!

This week, I'm off to Montreal for the International Congress on Human Genetics and I hope to see you there!

If you are attending and are already a part of the Twitterverse, bring a tablet or phone and tweet away about the meeting using the official hashtag, #ICHG2011. If you are new to twitter, go sign up! Using nearly any twitter application, you can search for tweets that contain the #ICHG2011 hashtag and follow the thoughts of your fellow conference goers.

Using Twitter at an academic conference is a fascinating experience! Not only can you get fantastic information about what is being presented at the multitude of sessions, you get lots of opinion on what is going on in the field, and sometimes practically useful tips, like the location of the nearest Starbucks.

If you see me wandering aimlessly around the conference center, please come say hello! Its always great to make new academic friends.

See you there!

Will
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.