Thursday, June 24, 2010

Using Expression Data to Mine the "Gray Zone" of GWAS

Researchers in the ENGAGE consortium used a clever technique to leverage genome-wide expression data to select or prioritize genes for GWAS analysis. The investigators published the novel candidate genes for obesity in this month's PLoS Genetics, but I think the method they used here is more interesting.

If you're studying obesity and you find that expression of some gene correlates with BMI, you have a problem in that you don't know whether the correlation indicates a causal relationship or if the changes in gene expression were simply reactive to changes in body composition. This is the case when looking at unrelated individuals - some correlations will be reactive, others potentially causal. However, if you're looking only in identical twins, you know that all the correlations you see are reactive, because MZ twins are genetically identical. The authors here took an interesting approach to prioritize genes for GWAS analysis that were correlated in the unrelated individuals only, and not in the MZ twins.

Following up these "causal" genes in a GWAS analysis the authors found that the p-value distribution was highly biased away from the null - in other words, more of these genes were associated than you'd expect by chance. The genes dubbed reactive were biased toward the null, i.e. fewer variants in these genes were associated with the phenotype.

While not everyone has easy access to whole-genome expression data on MZ twins before doing a GWAS, I wonder if the idea can be extended out to siblings or even more distant relatives, perhaps leveraging the kinship coefficient as a measure of relatedness between two individuals to "nudge" the transcript in question more towards causal versus reactive. Anyhow, check out the paper linked, it's a very clever idea.

(On a slightly related note, check out this interesting discussion about open access publishing a la PLoS versus traditional scientific publishing)

PLoS Genetics: Use of Genome-Wide Expression Data to Mine the “Gray Zone” of GWA Studies Leads to Novel Candidate Obesity Genes

Wednesday, June 16, 2010

Ten Reasons Why Grad Students Should Blog

NYU PhD student Drew Conway has compiled a very nice list of 10 reasons why grad students should blog. I've been writing GGD for a little over a year now and it's been a great way to extend my own network past the Vanderbilt walls, participate in lively discussions with other scientists oceans away, and to write stuff that people actually read and find useful. Especially for grad students (and postdocs as well), one of the most important points in Drew's post is on using a blog to establish an identity:
If you are in graduate school to be the “best kept secret in academia,” you are making a fatal mistake. As with any other job market, getting the proverbial foot in the door for a job talk at a university is a critical first step. As a graduate student it can be incredibly difficult to navigate the sea of senior faculty, their research agendas, and how that fits into your career goals. Having a blog provides you an independent beacon upon which you can broadcast your own ideas.
Another reason I can add to the list is related to #4 - extending your network. It's very gratifying to go to a meeting or conference and meet someone who regularly reads your blog. In a way they already know who you are, and you immediately have a starting point to launch a conversation. Check out the full list at Drew's blog, Zero Intelligence Agents, at the link below.

Ten Reasons Why Grad Students Should Blog

Monday, June 14, 2010

The Sweeping Assumptions of GWAS

As a graduate student a few years ago, I learned about (and in some cases witnessed) the various phases, fads, and revolutions in the field of human genetics. The mid to late 90's saw a shift from family-based linkage analysis to a plethora of small candidate gene studies. The early 2000's saw the completion of the human genome project, the development of the HapMap project, and the birth of genome-wide association studies. And very shortly, I believe we will transition to either partial or whole-genome sequencing as the study design of choice. Lots of factors motivate these shifts, including cost-effectiveness, sample availability, pressure to innovate, and simply the "bandwagon" effect.

It often seems, however, that the last factor to be considered is the hypothesis of the underlying disease model. Much like choosing a statistical test, each study design is coupled with a specific hypothesis and corresponding assumptions that are tested. Linkage (co-segregation of a genomic region with disease), candidate gene (association of a specific allele within a gene of interest), GWAS (association of common variants), and sequencing (identification of low frequency alleles) carry with them a null hypothesis that can be rejected when the study is sufficiently powered.

In 2000, when the march toward GWAS began, Terwilliger and Göring presented arguments against the common disease/common variant hypothesis, and they recently published an updated perspective on this argument.

Terwilliger JD, Göring HH. Update to Terwilliger and Göring's "Gene mapping in the 20th and 21st centuries" (2000): gene mapping when rare variants are common and common variants are rare. Hum Biol. 2009 Dec;81(5-6):729-33.


I have to admit that GWAS bashing is always a fun read, and the authors go the extra mile by citing references to all those who are now suddenly adopting their view in light of new sequencing technologies. I would however like to point out that the authors could easily find themselves cited in a future publication that denotes the folly of whole-genome sequencing… After all, there are so many possible explanations for the missing heritability of common diseases – why should we expect the multiple rare-variant/common disease hypothesis to be the holy grail?

Besides, EVERYONE knows its all due to methylation. :)

Friday, June 11, 2010

Seminar: T-Test on Fold Changes

I've had friends in biochem "wet" labs who've asked me to do some simple statistics on some of their results. This looks like an interesting seminar to attend if you've ever thought about doing a t-test on fold changes in some outcome measure between treatment and control groups, a pretty common outcome in biochemical assays. If the speaker provides slides electronically I'll happily post them here after the seminar.

Department of Biostatistics Seminar/Workshop Series:

T-Test on Fold Changes

Tatsuki Koyama, PhD
Assistant Professor of Biostatistics, Cancer Biostatistics Center, Vanderbilt-Ingram Cancer Center
Wednesday, June 16, 1:30-2:30pm, MRBIII Conference Room 1220

Basic science experiments often use a separate control group for each treatment group. Typically, the treatment group outcomes are scaled by the average of the corresponding control group outcomes. Despite its overwhelming popularity, this "fold change" method has serious statistical problems resulting in reduced validity. When the implicit variability of the control group outcomes is ignored, a large type I error inflation can result. Likewise, this scaling induces correlation and can substantially inflate the type I error when this correlation is ignored. We present simulations showing that this inflation results in type I error rates as high as 50% in everyday settings. We propose some computational and analytical approaches for dealing with this problem, and we present some practical recommendations for experimental designs with small sample sizes. Intended audience: Clinical and basic science researchers and statisticians.

Wednesday, June 9, 2010

Efficient Mixed-Model Association eXpedited (EMMAX) to Simutaneously Account for Relatedness and Stratification in Genome-Wide Association Studies

A few months ago I covered an algorithm called EMMA (Efficient Mixed-Model Association) implemented in R for simultaneously correct for both population stratification and relatedness in an association study. This method/software is very useful because most methods that account for relatedness in an association study assume a genetically (ethnically) homogeneous population, while methods that detect and correct for population stratification typically assume individuals are unrelated. The EMMA algorithm simultaneously accounts for both types of population structure by using a linear mixed model with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects.

The original EMMA algorithm, however, is computationally infeasible for datasets with thousands of individuals because the variance components parameters are estimated for each marker, which can take about 10 minutes per marker on the authors' large GWAS dataset, which would take over 6 years to complete on a single processor. A new implementation of the algorithm called EMMAX (Efficient Mixed-Model Association eXpedited) makes the simplifying assumption that because the effect of any given SNP on the trait is typically small, then the variance parameters only need to be estimated once for the entire dataset, rather than once for each marker.

In the paper the authors take the Northern Finland Birth Cohort and estimate genomic control inflation factors (gamma) for uncorrected test statistics, test statistics adjusted for the top 100 principle components using Eigenstrat, and corrected for structure using the EMMAX algorithm and found that the inflation factors were closest to 1 for the EMMAX-corrected tests. Further, whereas genomic control simply adjusts all test statistics downward without changing the rank of the test statistics, the EMMAX method does result in changes of the ranks of test statistics for each SNP.

A beta version of EMMAX is available online, with a complete version to be released soon. Conveniently, the software is able to take a PLINK transposed ped file and covariate files as input (tped and tfam documentation here).

Nature Genetics Technical Report - Variance component model to account for sample structure in genome-wide association studies

Monday, June 7, 2010

Goncalo Abecasis "Sequencing 1000s of Human Genomes" Faculty Candidate Seminar

Vanderbilt Center for Human Genetics Research faculty candidate Goncalo Abecasis will be interviewing for a faculty position here this week. Come check out his seminar - "Sequencing Thousands of Human Genomes" - Friday June 11th, 2-3pm in 512 Light Hall.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.