Getting Genetics Done: GWAS

Showing posts with label GWAS. Show all posts

Tuesday, April 22, 2014

Russ Altman's Translational Bioinformatics Year in Review

A few weeks ago the 2014 AMIA Translational Bioinformatics Meeting (TBI) was held in beautiful San Francisco. This meeting is full of great science that spans the divide between molecular and clinical research, but a true highlight of this meeting is the closing keynote, traditionally given by Russ Altman. Each year, he highlights the most exciting articles/publications in translational bioinformatics, and presents them all in a tidy, entertaining talk. You can find his blog here, along with a link to his slide deck.

To make accessing the papers a bit easier, here is a list of links.

Controversies:

FDA and 23andMe

Lior Pachter's "Network Nonsense"

Inconsistency in Large Pharmacogenomic Studies

Clinical Genomics:

Warfarin Dosing with Pharmacogenomics

Clinically Actionable Genotypes in Patients with Premptive Pharmacogenomic Testing

Genic Intolerance to Functional Variation

Estimating the Relative Pathogenicity of Human Genetic Variants

Analyzing the Incidentalome

Whole Genome Sequencing in Support of Wellness and Health Maintenance

Drugs:

Manual Curation of 88,000 Articles for Drug-Disease and Drug-Phenotype Interactions

Mining the Druggable Genome

Pathway-based Drug Screening Strategy

Identification of Proteins that Elicit Drug Side Effects

Network-Assisted Prediction of Potential Drugs for Addiction

Drug Repositioning of Antidepressants as Inhibitors of Small Cell Lung Cancer

Combinatorial Therapy Discovery Using Mixed Integer Linear Programming

The Druggable Genome

Identifying Drug Targets using Subtractive Genome Analyses

Genetic Basis of Disease:

Deleterious Variants in Mendelian Loci Contribute to Complex Disease

Comparison of PheWAS and GWAS data

Coherent Functional Modules Improve Transcription Factor Target Identification and Disease Association

A Disease-Phenotype Knowledge Base: Extracting Disease-Manifestation Relationships

A Common Rejection Module for Acute Rejection Across Multiple Organs

Network Models of GWAS Uncover Topological Centrality of Protein Interactions in Complex Disease

Emerging Data Sources:

A Network-Based Method for Analysis of lncRNA-Disease Associations

Lineage Structure of the Human Antibody Repertoire in Response to Influenza Vaccination

An Integrated Clinico-Metabolomic Model for Prediction of Death in Sepsis

Meta-Analyses of Studies of the Human Microbiota

PhenDisco: Phenotype Discovery in DBGaP

Comorbidity Clusters in Autism Spectrum Disorders

Network-based Analyses of Vaccine-Related Associations

Mice:

Knockouts Model the 100 Best-Selling Drugs

Mouse Model Phenotypes Provide Information About Human Drug Targets

Genomic Responses in Mouse Models Poorly Mimic Human Inflammatory Disease

The Scientific Process:

Atypical Combinations and Scientific Impact

Protein Interactions and Complex Disease

Quantifying Long-Term Scientific Impact

The Yale University Open Data Access Project

Evidence of Community Structure in Biomedical Research Grant Collaborations

Odds & Ends:

A Haplotype-Resolved Genome and Epigenome of the Aneuploid HeLa Cancer Cell Line

A Social Network of Hospital Acquired Infection Build from EMR Data

The Hidden Geometry of Complex, Network-Driven Contagion Phenomena

How Do You Feel? Your Computer Knows

Simulation of Repetitive Diagnostic Blood Loss and Onset of Iatrogenic Anemia

Wednesday, November 20, 2013

Using Database Joins to Compare Results Sets

One of the most powerful tools you can learn to use in genomics research is a relational database system, such as MySQL. These systems are fairly easy to setup and use, and provide users the ability to organize and manipulate data and statistical results with simple commands. As a graduate student (during the height of GWAS), this single skill quickly turned me into an “expert”. By storing the SNP lists for common GWAS platforms and some simple annotations from the UCSC and ENSEMBL databases, I could quickly provide lists of SNPs within a gene or collection of genes, or pull a list of SNPs that overlap two genotyping platforms. We even developed database modules that allowed us to easily define LD blocks within a database query (called LD-Spline).

Once you learn the basics of defining tables and loading data, you can start to join tables together, matching them on a common field. This is where the true power of a database system lies. Suppose you have two sets of results from a PLINK analysis, one from a discovery dataset and another from a replication. Rather than clumsily matching two sets of results within a spreadsheet application, a few simple queries within MySQL will tell you which SNPs are in common between the two sets, which were not found in the replication set, which SNPs were significant in the first set but not the second, etc.

The concept that makes these operations work is the idea of a primary key. A primary key is some field of a dataset that uniquely identifies each row of the table/dataset. In the above example of PLINK results, a good primary key might be the RS number of the SNP. You can also uniquely identify rows based on two columns, a concept known as a composite key – for example, the chromosome AND position of a SNP. Establishing a primary key allows MySQL to keep data stored in a sorted order and allows the matching operations for table joins to be performed much faster.

Having this sorted order from a primary key prevents MySQL from having to scan an entire table to find a specific value. Much like the index of a book, a primary key lets MySQL find a value within a table very quickly. If a table is small, having a primary key is not as critical; the computer can quickly scan the entire contents of the table for any query. If the table is large, however, a full scan of the entire table could be a costly operation, and the number of table scans required increases when doing a join. For example, if we join tables for our discovery and replication results sets, the database system will take the RS number for each entry from the discovery table and attempt to find a matching RS number in the replication table. If the replication table has the RS number as a primary key, the database system can very quickly find this entry. There is a fantastic post on the various types of database joins here.

Let's start by creating our database tables. A typical PLINK association output contains 12 columns (CHR, SNP, BP, A1, TEST, NMISS, OR, SE, L95, U95, STAT, P). In these tables, we've established the SNP column as the primary key. Recall that the primary key must uniquely identify each row of the table, so if there are multiple rows per SNP -- sometimes PLINK will report multiple TEST rows per SNP. If this is the case, we may need to either establish a composite key using PRIMARY KEY (`snp`,`test`), or simply eliminate these rows from the data file using an AWK command.

CREATE TABLE `discovery` (
 `chr` varchar(1),
        `snp` varchar(32),
        `bp` int, 
        `a1` varchar(1),
        `test` varchar(3),
        `nmiss` int,
        `or` float,
        `se` float,
        `l95` float,
        `u95` float,
        `stat` float,
 `p` float,
 PRIMARY KEY (`snp`)
);

CREATE TABLE `replication` (
       `chr` varchar(1),
       `snp` varchar(32),
       `bp` int,
       `a1` varchar(1),
       `test` varchar(3),
       `nmiss` int,
       `or` float,
       `se` float,
       `l95` float,
       `u95` float,
       `stat` float,
       `p` float,
       PRIMARY KEY (`snp`)
);

Before loading our data into these tables, a little pre-processing is helpful. To ensure that results are easy to read on the screen, PLINK developers used leading spaces in the column format for many PLINK outputs. These make loading the results into a database difficult. We can resolve this by running a simple SED command:

sed -r -e 's/\s+/\t/' -e 's/^\t//g' input-file.assoc.logistic > discovery.load

This will convert all spaces to tabs and will eliminate the leading spaces and write the results to a new file, discovery.load. Now lets load this file into our table, and repeat the procedure for our replication file.

LOAD DATA LOCAL INFILE '{PathToFile}/discovery.load' INTO TABLE 
discovery FIELDS TERMINATED BY '\t' IGNORE 1 LINES;

Now we should have two MySQL database tables with the discovery and results sets loaded into them. We can view their contents with a simple select statement. Then, finally, we can join these two tables to easily compare the results from the discovery and replication analyses.

SELECT * FROM discovery INNER JOIN replication ON 
discovery.snp = replication.snp;

The syntax is simple: select a set of fields -- in this case all of them (represented by the *) -- from the first table (discovery), matching each row from this table to a row in the second table (replication) where the discovery SNP equals the replication SNP. MySQL also supports a table alias which can make these queries a bit easier to write. An alias is simply a label specified after a table name which can be used in the rest of the query in place of the full table name. For example, in the query below, we use a for the discovery table and b for the replication table.

SELECT * FROM discovery a INNER JOIN replication b ON 
a.snp = b.snp;

With practice and additional data, join operations can be used to annotate results by gene or region, and to match these to results from other studies, such as the NHGRI GWAS catalog.

Wednesday, November 6, 2013

A Mitochondrial Manhattan Plot

Manhattan plots have become the standard way to visualize results for genetic association studies, allowing the viewer to instantly see significant results in the rough context of their genomic position. Manhattan plots are typically shown on a linear X-axis (although the circos package can be used for radial plots), and this is consistent with the linear representation of the genome in online genome browsers. Many genetic studies often overlook the other genome within all our cells, the mitochondrial genome. This circular molecule has been shown to be associated (albeit inconsistently) with many disease traits, and functional variants from this genome are now included in most genotyping platforms.

Thanks to the clever work of several graduate students in the Vanderbilt Center for Human Genetics Research (most notably Ben Grady), mitochondrial genetic associations can be visualized in the context of key regions of the mitochondrial genome using a radial plot in the R package ggplot2.

To make this plot, download the code and run the script (alternatively open the script in R and run interactively):

On the command line:

git clone git@github.com:stephenturner/solarplot.git
cd solarplot
R CMD BATCH solarplot.R

GitHub: Mitochondrial solar plot code and data

Thursday, May 30, 2013

PLATO, an Alternative to PLINK

Since the near beginning of genome-wide association studies, the PLINK software package (developed by Shaun Purcell’s group at the Broad Institute and MGH) has been the standard for manipulating the large-scale data produced by these studies. Over the course of its development, numerous features and options were added to enhance its capabilities, but it is best known for the core functionality of performing quality control and standard association tests.

Nearly 10 years ago (around the time PLINK was just getting started), the CHGR Computational Genomics Core (CGC) at Vanderbilt University started work on a similar framework for implementing genotype QC and association tests. This project, called PLATO, has stayed active primarily to provide functionality and control that (for one reason or another) is unavailable in PLINK. We have found it especially useful for processing ADME and DMET panel data – it supports QC and association tests of multi-allelic variants.

PLATO runs via command line interface, but accepts a batch file that allows users to specify an order of operations for QC filtering steps. When running multiple QC steps in a single run of PLINK, the order of application is hard-coded and not well documented. As a result, users wanting this level of control must run a sequence of PLINK commands, generating new data files at each step leading to longer compute times and disk usage. PLATO also has a variety of data reformatting options for other genetic analysis programs, making it easy to run EIGENSTRAT, for example.

The detail of QC output from each of the filtering steps is much greater in PLATO, allowing output per group (founders only, parents only, etc), and giving more details on why samples fail sex checks, Hardy-Weinberg checks, and Mendelian inconsistencies to facilitate deeper investigation of these errors. And with family data, disabling samples due to poor genotype quality retains pedigree information useful for phasing and transmission tests. Full documentation and download links can be found here (https://chgr.mc.vanderbilt.edu/plato). Special thanks to Yuki Bradford in the CGC for her thoughts on this post.

Monday, December 31, 2012

Getting Genetics Done 2012 In Review

Here are links to all of this year's posts (excluding seminar/webinar announcements), with the most visited posts in bold italic. As always, you can follow me on Twitter for more frequent updates. Happy new year!

New Year's Resolution: Learn How to Code

Annotating limma Results with Gene Names for Affy Microarrays

Your Publications (with PMCID) as a PubMed Query

Pathway Analysis for High-Throughput Genomics Studies

find | xargs ... Like a Boss

Redesign by Subtraction

Video Tip: Convert Gene IDs with Biomart

RNA-Seq Methods & March Twitter Roundup

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

Video Tip: Use Ensembl BioMart to Quickly Get Ortholog Information

Stepping Outside My Open-Source Comfort Zone: A First Look at Golden Helix SVS

How to Stay Current in Bioinformatics/Genomics

The HaploREG Database for Functional Annotation of SNPs

Identifying Pathogens in Sequencing Data

Browsing dbGAP Results

Fix Overplotting with Colored Contour Lines

Plotting the Frequency of Twitter Hashtag Usage Over Time with R and ggplot2

Cscan: Finding Gene Expression Regulators with ENCODE ChIP-Seq Data

More on Exploring Correlations in R

DESeq vs edgeR Comparison

Learn R and Python, and Have Fun Doing It

STAR: ultrafast universal RNA-seq aligner

RegulomeDB: Identify DNA Features and Regulatory Elements in Non-Coding Regions

Copy Text to the Local Clipboard from a Remote SSH Session

Differential Isoform Expression With RNA-Seq: Are We Really There Yet?

Wednesday, June 27, 2012

Browsing dbGAP Results

Thanks to the excellent work of Lucia Hindorff and colleagues at NHGRI, the GWAS catalog provides a great reference for the cumulative results of GWAS for various phenotypes. Anyone familiar with GWAS also likely knows about dbGaP – the NCBI repository for genotype-phenotype relationships – and the wealth of data it contains. While dbGaP is often thought of as a way to get access to existing genotype data, analysis results are often deposited into dbGaP as well. Individual-level data (like genotypes) are generally considered “controlled access”, requiring special permission to retrieve or use. Summary-level data, such as association p-values, are a bit more accessible. There are two tools available from the dbGaP website: the Association Results Browser and the Phenotype-GenotypeIntegrator (PheGenI). These tools provide a search interface for examining previous GWAS associations.

The Association Results Browser provides a simple table listing of associations, searchable by SNP, gene, or phenotype. It contains the information from the NHGRI GWAS catalog, as well as additional associations from dbGaP deposited studies. I’ve shown an example below for multiple sclerosis. You can restrict the search to the dbGaP-specific results by changing the “Source” selection. If you are looking for the impact of a SNP, this is a nice supplement to the catalog. Clicking on a p-value brings up the GaP browser, which provides a more graphical (but perhaps less useful) view of the data.

The PheGenI tool provides similar search functionality, but attempts to provide phenotype categories rather than more specific phenotype associations. Essentially, phenotype descriptions are linked to MeSH terms to provide categories such as “Chemicals and Drugs”, or “Hemic and Lymphatic Diseases”. PheGenI seems most useful if searching from the phenotype perspective, while the association browser seems better for SNP or Gene searches. All these tools are under active development, and I look forward to seeing their future versions.

Thursday, October 27, 2011

A New Dimension to Principal Components Analysis

In general, the standard practice for correcting for population stratification in genetic studies is to use principal components analysis (PCA) to categorize samples along different ethnic axes. Price et al. published on this in 2006, and since then PCA plots are a common component of many published GWAS studies. One key advantage to using PCA for ethnicity is that each sample is given coordinates in a multidimensional space corresponding to the varying components of their ethnic ancestry. Using either full GWAS data or a set of ancestral informative markers (AIMs), PCA can be easily conducted using available software packages like EIGENSOFT or GCTA. HapMap samples are sometimes included in the PCA analysis to provide a frame of reference for the ethnic groups.

Once computed, each sample will have values that correspond to a position in the new coordinate system that effectively clusters samples together by ethnic similarity. The results of this analysis are usually plotted/visualized to identify ethnic outliers or to simply examine the structure of the data. A common problem however is that it may take more than the first two principal components to identify groups.

To illustrate, I will plot some PCs generated based on 125 AIMs markers for a recent study of ours. I generated these using GCTA software and loaded the top 5 PCs into R using the read.table() function. I loaded the top 5, but for continental ancestry, I've found that the top 3 are usually enough to separate groups. The values look something like this:

new_ruid pc1 pc2 pc3 pc4 pc5
1 11596 4.10996e-03 -0.002883830 0.003100840 -0.00638232 0.00709780
2 5415 3.22958e-03 -0.000299851 -0.005358910 0.00660643 0.00430520
3 11597 -4.35116e-03 0.013282400 0.006398130 0.01721600 -0.02275470
4 5416 4.01592e-03 0.001408180 0.005077310 0.00159497 0.00394816
5 3111 3.04779e-03 -0.002079510 -0.000127967 -0.00420436 0.01257460
6 11598 6.15318e-06 -0.000279919 0.001060880 0.00606267 0.00954331

I loaded this into a dataframe called pca, so I can plot the first two PCs using this command:

plot(pca$pc1, pca$pc2)

We might also want to look at the next two PCs:

plot(pca$pc2, pca$pc3)

Its probably best to look at all of them together:

pairs(pca[2:4])

So this is where my mind plays tricks on me. I can't make much sense out of these plots -- there should be four ethnic groups represented, but its hard to see who goes where. To look at all of these dimensions simultaneously, we need a 3D plot. Now 3D plots (especially 3D scatterplots) aren't highly regarded -- in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots -- but in this case I found them quite useful.

Using a library called rgl, I generated a 3D scatterplot like so:

plot3d(pca[2:4])

Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out. Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data. Different parameters obviously produce different cluster patterns, but I found that using an "ellipsoidal model with equal variances" and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.

fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")

plot3d(pca[2:4], col = fit$classification)

Basically, the red sphere corresponds to the European descent group, the green indicates the admixed African American group, the black group corresponds to the Hispanic group, and the blue identifying the Asian descent group. We are still a bit confused as to why the Asian descent samples don't form a more concise cluster -- it may be due to relatively poor performance of these AIMs in Asian descent groups. Whatever the case, you might notice several individuals falling either outside a clear cluster or at the interface between two groups. The ethnic assignment for these individuals is questionable, but the clustering algorithm gives us a very nice measure of cluster assignment uncertainty. We can plot this like so:

plot(pca[2:3], cex = fit$uncertainty*10)

I had to scale the uncertainty factor by 10 to make the questionable points more visible in this plot, shown as the hollow circles. We will likely drop these samples from any stratified analyses. We can export the cluster assignment by accessing the fit$classification column, and we have our samples assigned to an ethnic group.

Tuesday, October 18, 2011

My thoughts on ICHG 2011

I’m a bit exhausted from a week of excellent science at ICHG. First, let me say that Montreal is a truly remarkable city with fantastic food and a fascinating blend of architectural styles, all making the meeting a fun place to be…. Now on to the genomics – I’ll recap a few of the most exciting sessions I attended. You can find a live-stream of tweets from the meeting by searching the #ICHG2011 and #ICHG hashtags.

On Wednesday, Marylyn Ritchie(@MarylynRitchie) and Nancy Cox organized “Beyond Genome-wide association studies”. Nancy Cox presented some ideas on how to integrate multiple “intermediate” associations for SNPs, such as expression QTLs and newly discovered protein QTLs (More on pQTLs later). This approach which she called a Functional Unit Analysis would group signals together based on the genes they influence. Nicholas Shork presented some nice examples of pros and cons of sequence level annotation algorithms. Trey Idekker gave a very nice talk illustrating some of the properties of epistasis in yeast protein interaction networks. One of the more striking points he made was that epistasis tends to occur between full protein complexes rather than within elements of the complexes themselves. Marylyn Ritchie presented the ideas behind her ATHENA software for machine learning analysis of genetic data, and Manuel Mattesian from Tim Becker’s group presented the methods in their INTERSNP software for doing large-scale interaction analysis. What was most impressive with this session is that there were clear attempts to incorporate underlying biological complexity into data analysis.

On Thursday, I attended the second Statistical Genetics section called “Expanding Genome-wide Association Studies”, organized by Saurabh Ghosh and Daniel Shriner. Having recently attended IGES, I feel pretty “up” on newer analysis techniques, but this session had a few talks that sparked my interest. The first three talks were related to haplotype phasing and the issues surrounding computational accuracy and speed. The basic goal of all these methods is to efficiently estimate genotypes for a common set of loci for all samples of a study using a set of reference haplotypes, usually from the HapMap or 1000 genomes data. Despite these advances, it seems like phasing haplotypes for thousands of samples is still a massive undertaking that requires a high-performance computing cluster. There were several talks about ongoing epidemiological studies, including the Kaiser Permanente UCSF cohort. Neil Risch presented an elegant study design implementing four custom GWAS chips for the four targeted populations. Looks like the data hasn't started to flow from this yet, but when it does we’re sure to learn about lots of interesting ethnic-specific disease effects. My good friend and colleague Dana Crawford presented an in silico GWAS study of hypothyroidism. In her best NPR voice, Dana showed how electronic medical records with GWAS data in the EMERGE network can be re-used to construct entirely new studies nested within the data collected for other specific disease purposes. Her excellent Post-Doc, Logan Dumitrescu presented several gene-environment interactions between Lipid levels and vitamin A and E from Dana’s EAGLE study. Finally Paul O’Reilly presented a cool new way to look at multiple phenotypes by essentially flipping a typical regression equation around, estimating coefficients that relate each phenotype in a study to a single SNP genotype as an outcome. This rather clever approach called MultiPhen is similar to log-linear models I’ve seen used for transmission-based analysis, and allows you to model the “interaction” among phenotypes in much the same way you would look at SNP interactions.

By far the most interesting talks of the meeting (for me) were in the Genomics section on Gene Expression, organized by Tomi Pastinen and Mark Corbett. Chris Mason started the session off with a fantastic demonstration of the power of RNA-seq. Examining transcriptomes of 14 non-human primate species, they validated many of the computational predictions in the AceView gene build, and illustrated that most “exome” sequencing is probably examining less than half of all transcribed sequences. Rupali Patwardhan talked about a system for examining the impact of promoter and enhancer mutations in whole mice, essentially using mutagenesis screens to localize these regions. Ron Hause presented work on the protein QTLs that Nancy Cox alluded to earlier in the conference. Using a high-throughput form of western blots, they systematically examined levels for over 400 proteins in the Yoruba HapMap cell lines. They also illustrate that only about 50% of eQTLs identified in these lines actually alter protein levels. Stephen Montgomery spoke about the impact of rare genetic variants within a transcript on transcript levels. Essentially he showed an epistatic effect on expression, where transcripts with deleterious alleles are less likely to be expressed – an intuitive and fascinating finding, especially for those considering rare-variant analysis. Athma Pai presented a new QTL that influences mRNA decay rates. By measuring multiple time points using RNA-seq, she found individual-level variants that alter decay, which she calls dQTLs. Veronique Adoue looked at cis-eQTLs relative to transcription factor binding sites using ChIP, and Alfonso Buil showed how genetic variants influence gene expression networks (or correlation among gene expression) across tissue types.

I must say despite all the awesome work presented in this session, Michael Snyder stole the show with his talk on the “Snyderome” – his own personal –omics profile collected over 21 months. His whole-genome was sequenced by Complete Genomics, and processed using Rong Chen and Atul Butte’s risk-o-gram to quantify his disease risk. His profile predicted increased risk of T2D, so he began collecting glucose measures and low and behold, he saw a sustained spike in blood glucose levels following a few days following a common cold. His interpretation was that an environmental stress knocked him into a pseudo-diabetic state, and his transcriptome and proteome results corroborated this idea. Granted, this is an N of 1, and there is still lots of work to be done before this type of analysis revolutionizes medicine, but the take home message is salient – multiple -omics are better than one, and everyone’s manifestation of a complex disease is different. This was truly thought-provoking work, and it nicely closed an entire session devoted to understanding the intermediate impact of genetic variants to better understand disease complexity.

This is just my take of a really great meeting -- I'm sure I missed lots of excellent talks. If you saw something good please leave a comment and share!

Monday, August 22, 2011

Estimating Trait Heritability from GWAS Data

Peter Visscher and colleagues have recently published a flurry of papers employing a new software package called GCTA to estimate the heritability of traits using GWAS data (GCTA stands for Genome-wide Complex Trait Analysis -- clever acronymity!). The tool, supported (and presumably coded) by Jian Yang is remarkably easy to use, based in part on the familiar PLINK commandline interface. The GCTA Homepage provides an excellent walk-through of the available options.

The basic idea is to use GWAS data to estimate the degree of "genetic sharing" or relatedness among the samples, computing what the authors call a genetic relationship matrix (GRM). The degree of genetic sharing among samples is then related to the amount of phenotypic sharing using restricted maximum likelihood analysis (REML). The result is an estimate of the variance explained by the SNPs used to generate the GRM. Full details of the stats along with all the gory matrix notation can be found in their software publication.

The approach has been applied to several disorders studied by the WTCCC and to a recent study of human height. Interestingly, the developers have also used the approach to partition the trait variance across chromosomes, resulting in something similar to population-based variance-components linkage analysis. The approach works for both quantitative and dichotomous traits, however the authors warn that variance estimates of dichotomous trait liability are influenced by genotyping artifacts.

The package also includes several other handy features, including a relatively easy way to estimate principal components for population structure correction, a GWAS simulation tool, and a regression-based LD mapping tool. Download and play -- a binary is available for Linux, MacOS, and DOS/Windows.

Tuesday, June 7, 2011

Replication of >180 Genetic Associations with Self-Report Medical Data

DNA genotyping and sequencing are getting cheaper every day. As Oxford Nanopore CTO Clive Brown recently discussed at Genomes Unzipped, when the cost of a full DNA sequence begins to fall below $1000, the value of having that information far outweighs the cost of data generation.

Participant collection and ascertainment, however, isn't getting cheaper any time soon, spurring a burgeoning interest in using DNA biobanks and electronic medical records (EMRs) for genomics research (reviewed here). In fact, this is exactly the focus of the eMERGE network - a consortium of five sites having biobanks linked to electronic medical records for genetic research. The first order of business for the eMERGE network was assessment - can DNA biobanks + EMRs be used for genetic research? This question was answered with a demonstration project using Vanderbilt University's BioVU biobank+EMR. Here, 21 previously reported associations to five complex diseases were tested for association to electronically abstracted phenotypes from BioVU's EMR. This forest plot shows that for the 21 previously reported associations (red), five replicated at a nominal significance threshold, and for the rest, the calculated odds ratios (blue) trended in the same direction as the reported association.

While electronic phenotyping is much cheaper than ascertainment in the traditional sense, it can still be costly and labor intensive, involving interative cycles of manual medical record chart review followed by refinement of natural language processing algorithms. In many instances, self-report can be comparably accurate, and much easier to obtain (for example, compare the eMERGE network's hypothyroidism phenotyping algorithm with simply asking the question: "Have you ever been diagnosed with Hashmoto's Thyroiditis?").

This is the approach to genetic research 23andMe is taking. Joyce Tung presented some preliminary results at last year's ASHG conference, and now you can read the preprint of the research paper online at Nature Preceedings - "Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data". In this paper the authors amassed self-reported data from >20,000 genotyped 23andMe customers on 50 medical phenotypes and attempted to replicate known findings from the GWAS catalog. Using only self-reported data, the authors were able to replicate at a nominal significance level 75% of the previously reported associations that they were powered to detect. Figure 1 from the preprint is similar to the figure above, where blue X's are prior GWAS hits and black dots and lines represent their ORs and 95% CI's:

One might ask whether there is any confounding due to the fact that 23andMe customers can view trait/disease risks before answering questions (see this paper and this discussion by Luke Jostins). The authors investigated this and did not observe a consistent or significant effect of seeing genetic risk results before answering questions. There's also a good discussion regarding SNPs that failed to replicate, and a general discussion of using self-report from a recontactable genotyped cohort for genetic studies.

Other than 23andMe's customer base, I can't think of any other genotyped, recontactable cohort that have incentive to fill out research surveys for this kind of study. But this team has shown here and in the past that this model for performing genetic research works and is substantially cheaper than traditional ascertainment or even mining electronic medical records. I look forward to seeing more research results from this group!

Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data

Wednesday, June 1, 2011

Resources for Pathway Analysis, Functional Annotation, and Interactome Reconstruction with Omics Data

I just read a helpful paper on pathway analysis and interactome reconstruction:

Tieri, P., Fuente, A. D., Termanini, A., & Franceschi, C. (2011). Integrating Omics Data for Signaling Pathways, Interactome Reconstruction, and Functional Analysis. In Bioinformatics for Omics Data, Methods in Molecular Biology, vol. 719. doi: 10.1007/978-1-61779-027-0. (PubMed).

The authors give a description about how each of these tools can be used in pathway analysis and functional annotation, along with an example of using several of these resources for mapping the interactome for transcription factor NF-kappa-B.

While a more extensive list of hundreds of tools and databases for biological pathway analysis can be found at Pathguide, this looks like a good starting point.

•	APID Agile Protein Interaction DataAnalyzer – http://bioinfow.dep.usal.es/apid/index.htm
•	Ariadne Genomics Pathway Studio – http://www.ariadnegenomics.com/products/pathway-studio
•	BIND Biomolecular Interaction Network Database – http://www.bind.ca
•	BioCarta Pathways – http://www.biocarta.com/genes/index.asp
•	BioGRID The Biological General Repository for Interaction Datasets – http://www.thebiogrid.org
•	BiologicalNetworks – http://biologicalnetworks.net
•	CellDesigner – http://www.celldesigner.org
•	ClusterMaker – http://www.cgl.ucsf.edu/cytoscape/cluster/clusterMaker.html
•	Cytoscape – http://www.cytoscape.org
•	DIP Database of Interacting Proteins – http://dip.doe-mbi.ucla.edu/dip/Main.cgi
•	Entrez Gene – http://www.ncbi.nlm.nih.gov/gene
•	GenMAPP Gene Map Annotator and Pathway Profiler – http://www.genmapp.org
•	GraphWeb – http://biit.cs.ut.ee/graphweb
•	HPRD Human Protein Reference Database – http://www.hprd.org
•	HUBBA Hub objects analyzer – http://hub.iis.sinica.edu.tw/Hubba
•	Ingenuity Systems – http://www.ingenuity.com
•	IntAct – http://www.ebi.ac.uk/intact
•	KEGG Kyoto Encyclopedia of Genes and Genomes – http://www.genome.jp/kegg
•	MINT the Molecular INTeraction database – http://mint.bio.uniroma2.it/mint
•	NCI-Nature Pathway Interaction Database – http://pid.nci.nih.gov
•	NetPath – http://www.netpath.org
•	NetworkAnalyzer – http://med.bioinf.mpi-inf.mpg.de/netanalyzer
•	Pajek – http://vlado.fmf.uni-lj.si/pub/networks/pajek
•	Pathguide: the pathway resource list – http://www.pathguide.org
•	PathVisio – http://www.pathvisio.org
•	Pathway Commons – http://www.pathwaycommons.org
•	R Project for Statistical Computing – http://www.r-project.org
•	Reactome – http://www.reactome.org
•	SBW Systems Biology Workbench – http://sbw.sourceforge.net
•	TRANSFAC & TRANSPATH – http://www.gene-regulation.com
•	TRED Transcriptional Regulatory Element Database – http://rulai.cshl.edu/cgi-bin/TRED
•	UniProt – http://www.uniprot.org
•	WikiPathways – http://www.wikipathways.org/index.php/WikiPathways

Integrating Omics data for signaling pathways, interactome reconstruction, and functional analysis.

Thursday, May 19, 2011

More Command-Line Text Munging Utilities

In a previous post I linked to gcol as a quick and intuitive alternative to awk. I just stumbled across yet another set of handy text file manipulation utilities from the creators of the BEAGLE software for GWAS data imputation and analysis. In addition to several command line utilities for converting and formatting BEAGLE files, there are several tools for doing basic text processing tasks on the command line:

changecolumn.jar - replace values in a column of an input file.
changeline.jar - replace values in a line of an input file.
cut.jar - extract columns from a file.
filtercolumns.jar - filters columns of input data according to the values in a line.
filterlines.jar - filters lines of input data according to the values in a column.
paste.jar - pastes together files that have shared initial columns followed by data columns.
transpose.jar - transposes rows and columns of an input file.

Much of what these tools do can probably be emulated with some creativity with Unix commands and pipes. But since these are all Java archives they should work on any platform, not just Unix/Linux. Hit the link below to see the full list and documentation.

BEAGLE Utilities for text manipulation

Wednesday, May 4, 2011

PLINK/SEQ for Analyzing Large-Scale Genome Sequencing Data

PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it's syntax will be familiar to PLINK users.

PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase. It supports several data types (multiallelic, phased, imputation probabilities, indels, and structural variants), and can handle datasets much larger than what can fit into memory. PLINK/SEQ also comes bundled with several reference databases of gene transcripts and sequence & variation projects, including dbSNP and 1000 Genomes Project data.

As with PLINK, the documentation is good, and there's a tutorial using 1000 Genomes Project data.

PLINK/SEQ - A library for the analysis of genetic variation data

Monday, April 25, 2011

Annotated Manhattan plots and QQ plots for GWAS using R, Revisited

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

Last year I showed you how to create manhattan plots, and later how to highlight regions of interest, using ggplot2 in R. The code was slow, required a lot of memory, and was difficult to maintain and modify.

I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you're familiar with base R's plot options and graphical parameters, most of these can now be passed to the functions to tweak the plots' appearance. The code also behaves differently depending on whether you have results for one or more than one chromosome.

Here's a quick demo.

First, either copy and paste the code from GitHub, or run the following commands in R to download and source the code from GitHub (you can't directly read from https in R, so you have to download the file first, the source it). Note the command is different on Mac vs Windows.

Download the function code on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r", method="curl")

Download the function code on Windows, leave out the method="curl" argument:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r")

Now, source the script containing the functions.

source("./qqman.r")

Next, load some GWAS results, and take a look at the relevant columns (same as above, download first then read locally from disk). This is standard output from PLINK's --assoc option.

Download example data on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz", method="curl")

Download example data on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz")

Read in the sample results, and take a look at the first few lines:

results = read.table(gzfile("./plink.assoc.txt.gz"), header=T)

head(subset(results, select=c(SNP, CHR, BP, P)))

The manhattan function assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, genomic coordinate, and p-value. Missing values (where regression failed to converge, for instance) should be recoded NA before reading into R. Do this with a quick sed command. Here's what the data looks like:

         SNP CHR        BP       P
1 rs10495434   1 235800006 0.62220
2  rs6689417   1  46100028 0.06195
3  rs3897197   1 143700035 0.10700
4  rs2282450   1 202300047 0.47280
5   rs567279   1  66400050      NA
6 rs11208515   1  64900051 0.53430

Now, create a basic manhattan plot (click the image to enlarge):

manhattan(results)

If you type args(manhattan) you can see the options you can set. Here are a few:

colors: this is a character vector specifying the colors to cycle through for coloring each point. Here's a PDF chart of R's color names.

ymax: this is the y-axis limit. If ymax="max" (default), the y-axis will always be a little bit higher than the most significant -log10(p-value). Otherwise you can set this value yourself.

cex.x.axis: this can be used to shrink the x-axis labels by setting this value less than 1. This is handy if some of the tick labels aren't showing up because the plot region is too small.

*** Update March 9 2012*** cex.x.axis is deprecated. To change the x-axis size, use the default base graphics argument cex.axis.

limitchromosomes: you can limit which chromosomes you want to display. By default this restricts the plot to chromosomes 1-23(x).

suggestiveline and genomewideline: set these to FALSE if you don't want threshold lines, or change the thresholds yourself.

annotate: by default this is undefined. If you supply a character vector of SNP names (e.g. rs numbers), any SNPs in the results data frame that also show up here will be highlighted in green by default. example below.

... : The dot-dot-dot means you can pass most other plot or graphical parameters to these functions (e.g. main, cex, pch, etc).

Make a better looking manhattan plot. Change the plot colors, point shape, and remove the threshold lines:

manhattan(results, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)

Now, read in a text file with SNP names that you want to highlight, then make a manhattan plot highlighting those SNPs, and give the plot a title:

Download a SNP list on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt", method="curl")

Download a SNP list on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt")

Read in the SNP list, and create an annotated plot:

snps_to_highlight = scan("./snps.txt", character())

manhattan(results, annotate=snps_to_highlight, pch=20, main="Manhattan Plot")

Finally, zoom in and plot only the results for chromosome 11, still highlighting those results. Notice that the x-axis changes from chromosome to genomic coordinate.

manhattan(subset(results, CHR==11), pch=20, annotate=snps_to_highlight, main="Chr11")

Finally, make a quantile-quantile plot of the p-values. To make a basic qq-plot of the p-values, pass the qq() function a vector of p-values:

qq(results$P)

Perhaps we should have made the qq-plot first, as it looks like we might have some unaccounted-for population stratification or other bias.

The code should run much faster and use less memory than before. All the old functions that use ggplot2 are still available, now prefixed with "gg." Please feel free to use, modify, and redistribute, but kindly link back to this post. The function source code, data, SNPs of interest, and example code used in this post are all available in the qqman GitHub repository.

Tuesday, April 12, 2011

Using R + Bioconductor to Get Flanking Sequence Given Genomic Coordinates

I'm working on a project using next-gen sequencing to fine-map a genetic association in a gene region. Now that I've sequenced the region in a small sample, I'm picking SNPs to genotype in a larger sample. When designing the genotyping assay the lab will need flanking sequence.

This is easy to get for SNPs in dbSNP, but what about novel SNPs? Specifically, given a list of genomic positions where about half of them are potentially novel SNPs in the population I'm studying, how can I quickly and automatically fetch flanking sequence from the reference sequence? I asked how to do this on previously mentioned Biostar, and got several answers within minutes.

Fortunately this can easily be done using R and Bioconductor. For several reasons BioMart wouldn't do what I needed it to do, and I don't know the UCSC/Ensembl schemas well enough to use a query or the Perl API.

This R code below (modified from Adrian's answer on BioStar) does the trick:

Tuesday, March 29, 2011

Prune GWAS data in R

Hansong Wang, our biostats professor here at the Hawaii Cancer Center, generously gave me some R code that goes through a SNP annotation file (i.e. a mapfile) and selects SNPs that are at least a certain specified distance apart. You might want to do this if you're picking a subset of SNPs for PCA, for instance. Plink has an LD pruning feature, but if you can't load your data into PLINK, this poor-man's-pruning based on physical distance (not LD) is a quick solution.

Provide the function with a data frame containing containing column names "chrom" and "position," where the SNPs are ordered by chromosome and position. By default the function selects SNPs that are at least 100kb apart, but you can change this with the optional second argument. The function returns the row indices corresponding to the SNPs you want to keep. Then simply subset your dataset selecting only those row indices and all columns.

Friday, March 18, 2011

New GenABEL Website, and more *ABEL software

The *ABEL suite of R packages and software for genetic analysis has grown substantially since the appearance of GenABEL and the previously mentioned ProbABEL R packages. There are now a handful of useful R packages and other software utilities facilitating genome-wide association studies, analysis of imputed data, meta-analysis, efficient data storage, prediction, parallelization, and mixed model methods. The new GenABEL website also has prominent links to extensive documentation in manuals and tutorials. Finally, there's a new forum you can visit if you can't find an answer in the manuals or tutorials.

The GenABEL Project for Statistical Genetics Analysis

This blog has moved!