Thursday, April 29, 2010

More on the McClellan / King GWAS essay

First, if you haven't taken a look at the comments on my previous post on this paper, go take a look. Thanks to everyone for sharing your thoughts and pointing out some of my own oversight regarding this paper.

There was one issue in particular that deserves more attention than just another comment thread. McClellan and King draw special attention to a study by Kai Wang et al (2009) Common genetic variants on 5p14.1 associate with autism spectrum disorders. Nature 459:528. A big thanks to Kai Wang for pointing out this particularly egregious misrepresentation by McClellan and the emphasis I added in my own all-too-cursory review. McClellan discuss rs4307059, reported by Wang et al. to be associated with autism, as a “particularly dramatic example of the perils of cryptic population stratification”, reasoning that the substructure is a result of large frequency differences across Europe and its fixation in Africa, when in fact the frequency of this SNP is fairly consistent across large cohorts of European ancestry: European Americans (MAF=39%), WTCCC (MAF=38%), POPRES British (MAF=39%), POPRES Spanish (MAF=37%). The extreme estimates (.21-.77) come from extremely small sample sizes (n=7 in Tuscany, MAF=75%, and n=15 in the Orcadian sample, MAF=25%). These sample sizes are way to small to estimate allele frequencies with any stability. In fact, you can see the allele frequency distribution across 51 populations here, which shows that it's quite similar across most of Europe:

Further, using the full Fst data set (which can be downloaded directly at this link), if you sort all Illumina SNPs by their variation of allele frequencies (more precisely, Fst), the SNP rs4307059 lies right in middle, so it is fairly normal for any SNP with similar MAF to display variation of allele frequencies in subpopulations in Europe or in HapMap.

There are a few other issues pointed out in the comment thread that deserve attention. McClellan asserted, and I emphasized, that most GWAS hits do not replicate. While it's definitely true that nonreplication was a huge issue in genetic association studies in the past and in the early days of GWAS, most GWAS hits that are genome-wide significant (e.g. p<1e-8) DO replicate, and studies done with family designs, which can't be explained away by population stratification, add further evidence that many of these associations are genuine. And simply because a SNP lies outside a region with known biological function doesn't mean we should wave it off so easily. There's a nice discussion of this over at Gene Expression.


  1. Thanks for opening a new thread for discussion. I realized that Toscani population is actually part of HapMap3, so the allele frequency can be inferred from there (n=102, still small but good enough). I assumed that "Toscani in Italia" in HapMap is similar to "Tuscan Italy" in HGDP. The MAF (C allele) is indeed 41% in HapMap sample (202 chromosomes, hapmap 3 release 3,, which is fairly similar to European Americans and not even remotely close to the 77% number inferred from n=7 by McClellan et al.

    This exercise represents a particularly dramatic example how small sample size in whole-genome data can lead to biased estimates and conclusions. It also teaches us that rigorous scientists should always quantify the uncertainty of results, or at least present a sample size/power. Failing to do that, we would be merely fooling ourselves and the readers outside of the community.

  2. I have really enjoyed reading all of the commentary about this article provided on this blog and other linked blogs. Overall - I feel the authors fail to recognize the process that is science ... sure we might not know the functionality right now - but many of the replicated GWAS signals are clearly very robust associations and deserve to be explored further. For example, it wasn't that long ago that the ENCODE project provided evidence of the pervasively transcribed genome, nearly eliminating the term "junk DNA" from having any meaning. Who knows what we might learn in a few years after pursuing the GWAS hits?

    While a thought provoking article, I think it is filled with justification and support for their own underlying research agenda.

  3. I agree there are interesting problems with the paper...however there is also a "too big to fail/bubble" aspect to GWAS and many other systems biology approaches. At a recent conference in Oxford one of the science/phil folks raised the point: "one thing that systems approaches definitely do is concentrate resources at large centres" to pick on the picked on...what does a community of fairly narrowly trained folks do...if the one approach they know well falls out of fashion. A lot of 80s era organismal/biochemist types might have some wisdom to share...short answer..the U.S. solution is "off with their heads"..not fun.

  4. What does a community of fairly narrowly trained folks do...if the one approach they know well falls out of fashion. The U.S. solution is "off with their heads."

    Or, start sequencing. The last bastion of the sinner.

  5. How do I may plot this figura using R (alleles frequency over the world map)?

  6. It seems like it is much more dense in Europe than in Africa unless the chart shown above is not to scale.


Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.