Wednesday, June 27, 2012

Browsing dbGAP Results

Thanks to the excellent work of Lucia Hindorff and colleagues at NHGRI, the GWAS catalog provides a great reference for the cumulative results of GWAS for various phenotypes.  Anyone familiar with GWAS also likely knows about dbGaP – the NCBI repository for genotype-phenotype relationships – and the wealth of data it contains.  While dbGaP is often thought of as a way to get access to existing genotype data, analysis results are often deposited into dbGaP as well.  Individual-level data (like genotypes) are generally considered “controlled access”, requiring special permission to retrieve or use.  Summary-level data, such as association p-values, are a bit more accessible.  There are two tools available from the dbGaP website: the Association Results Browser and the Phenotype-GenotypeIntegrator (PheGenI).  These tools provide a search interface for examining previous GWAS associations. 

The Association Results Browser provides a simple table listing of associations, searchable by SNP, gene, or phenotype.  It contains the information from the NHGRI GWAS catalog, as well as additional associations from dbGaP deposited studies.  I’ve shown an example below for multiple sclerosis.  You can restrict the search to the dbGaP-specific results by changing the “Source” selection.  If you are looking for the impact of a SNP, this is a nice supplement to the catalog.  Clicking on a p-value brings up the GaP browser, which provides a more graphical (but perhaps less useful) view of the data.

The PheGenI tool provides similar search functionality, but attempts to provide phenotype categories rather than more specific phenotype associations.  Essentially, phenotype descriptions are linked to MeSH terms to provide categories such as “Chemicals and Drugs”, or “Hemic and Lymphatic Diseases”.  PheGenI seems most useful if searching from the phenotype perspective, while the association browser seems better for SNP or Gene searches.  All these tools are under active development, and I look forward to seeing their future versions.

Thursday, June 21, 2012

Identifying Pathogens in Sequencing Data

I just read an interesting paper on pathogen discovery using next-generation sequencing data, recommended to me by Nick Loman.

A previously described algorithm (PathSeq, Kostic et al) for discovering microbes by deep-sequencing human tissue uses computational subtraction, whereby the initial collection of reads is depleted of human DNA by consecutive alignment to the human reference using MAQ and BLAST.
The PathSeq method: computational subtraction by depleting complete read set of all reads mapping to the human reference.

The method described here, Rapid Identification of Nonhuman Sequences (RINS), uses and intersection-based workflow rather than computational subtraction. RINS first maps to a user-supplied custom reference (e.g. a collection of all known viruses and bacteria), thereby drastically lowering computational requirements and increasing sensitivity. Contigs are then assembled de novo, and the original reads are then mapped back onto assembled contigs, which increases specificity.
The RINS method - uses intersection rather than subtraction to identify non-human reads.

The authors of the RINS (intersection) paper noted similar sensitivity and specificity to the PathSeq (subtraction) method in a fraction of the time (2 hours on a desktop machine versus 13 hours on the cloud).

Rapid Identification of Nonhuman Sequences in High Throughput Sequencing Data Sets

Monday, June 11, 2012

The HaploREG Database for Functional Annotation of SNPs

The ENCODE project continues to generate massive numbers of data points on how genes are regulated.  This data will be of incredible use for understanding the role of genetic variation, both for altering low-level cellular phenotypes (like gene expression or splicing), but also for complex disease phenotypes.  While it is all deposited into the UCSC browser, ENCODE data is not always the easiest to access or manipulate. 

To make epigenomic tracks from the ENCODE project more accessible for interpretation in the context of new or existing GWAS hits, Luke Ward and Manolis Kellis at the BROAD Institute have developed a database called HaploREG.  HaploREG uses LD and SNP information from the 1000 Genomes project to map known genetic variants onto ENCODE data, providing a potential mechanism for SNP influence.  HaploREG will annotate SNPs with evolutionary constraint measures, predicted chromatin states, and how SNPs alter the Positional Weight Matrices of known transcription factors.

Here's a screenshot from SNP associated with HDL cholesterol levels showing summary information for several SNPs in LD at R2>0.9 in CEU. Clicking each SNP link provides further info.

In addition to providing annotations of user-submitted SNPs, HaploREG also provides cross-references from the NHGRI GWAS Catalog, allowing users to explore the mechanisms behind disease associated SNPs.  Check out the site here: and explore the functionality of any SNPs you might find associated in your work.  The more functional information we can include in our manuscripts, the more likely they are to be tested in a model system.

HaploReg: Functional Annotation of SNPs
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.