DNA genotyping and sequencing are getting cheaper every day. As Oxford Nanopore CTO Clive Brown recently discussed at Genomes Unzipped, when the cost of a full DNA sequence begins to fall below $1000, the value of having that information far outweighs the cost of data generation.
Participant collection and ascertainment, however, isn't getting cheaper any time soon, spurring a burgeoning interest in using DNA biobanks and electronic medical records (EMRs) for genomics research (reviewed here). In fact, this is exactly the focus of the eMERGE network - a consortium of five sites having biobanks linked to electronic medical records for genetic research. The first order of business for the eMERGE network was assessment - can DNA biobanks + EMRs be used for genetic research? This question was answered with a demonstration project using Vanderbilt University's BioVU biobank+EMR. Here, 21 previously reported associations to five complex diseases were tested for association to electronically abstracted phenotypes from BioVU's EMR. This forest plot shows that for the 21 previously reported associations (red), five replicated at a nominal significance threshold, and for the rest, the calculated odds ratios (blue) trended in the same direction as the reported association.
While electronic phenotyping is much cheaper than ascertainment in the traditional sense, it can still be costly and labor intensive, involving interative cycles of manual medical record chart review followed by refinement of natural language processing algorithms. In many instances, self-report can be comparably accurate, and much easier to obtain (for example, compare the eMERGE network's hypothyroidism phenotyping algorithm with simply asking the question: "Have you ever been diagnosed with Hashmoto's Thyroiditis?").
This is the approach to genetic research 23andMe is taking. Joyce Tung presented some preliminary results at last year's ASHG conference, and now you can read the preprint of the research paper online at Nature Preceedings - "Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data". In this paper the authors amassed self-reported data from >20,000 genotyped 23andMe customers on 50 medical phenotypes and attempted to replicate known findings from the GWAS catalog. Using only self-reported data, the authors were able to replicate at a nominal significance level 75% of the previously reported associations that they were powered to detect. Figure 1 from the preprint is similar to the figure above, where blue X's are prior GWAS hits and black dots and lines represent their ORs and 95% CI's:
One might ask whether there is any confounding due to the fact that 23andMe customers can view trait/disease risks before answering questions (see this paper and this discussion by Luke Jostins). The authors investigated this and did not observe a consistent or significant effect of seeing genetic risk results before answering questions. There's also a good discussion regarding SNPs that failed to replicate, and a general discussion of using self-report from a recontactable genotyped cohort for genetic studies.
Other than 23andMe's customer base, I can't think of any other genotyped, recontactable cohort that have incentive to fill out research surveys for this kind of study. But this team has shown here and in the past that this model for performing genetic research works and is substantially cheaper than traditional ascertainment or even mining electronic medical records. I look forward to seeing more research results from this group!
Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data
As I understand it, replication data reaches significance when CI for the blue dots do not include OR=1. If so, there are 7 markers significant in the replication. Is that right?
ReplyDelete