Thursday, May 15, 2014

qqman: an R package for creating Q-Q and manhattan plots from GWAS results

Three years ago I wrote a blog post on how to create manhattan plots in R. After hundreds of comments pointing out bugs and other issues, I've finally cleaned up this code and turned it into an R package.

The qqman R package is on CRAN: http://cran.r-project.org/web/packages/qqman/

The source code is on GitHub: https://github.com/stephenturner/qqman

If you'd like to cite the qqman package (appreciated but not required), please cite this pre-print: Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv DOI: 10.1101/005165 (2014).

Something wrong? Please file bug reports, feature requests, or anything else related to the code as an issue on GitHub rather than commenting here. Also, please post the code you're using and some example data causing a failure in a publicly accessible place, such as a GitHub gist (no registration required). It's difficult to troubleshoot if I can't see the data where the code is failing. Want to contribute? Awesome! Send me a pull request.

Note: This release is substantially simplified for the sake of maintainability and creating an R package. The old code that allows confidence intervals on the Q-Q plot and allows more flexible annotation and highlighting is still available at the version 0.0.0 release on GitHub.

Here's a shout-out to all the blog commenters on the previous post for pointing out bugs and other issues, and a special thanks to Dan Capurso and Tim Knutsen for useful contributions and bugfixes.


qqman package tutorial

First things first. Install the package (do this only once), then load the package (every time you start a new R session)
# only once:
install.packages("qqman")

# each time:
library(qqman)
You can access this help any time from within R by accessing the vignette:
vignette("qqman")
The qqman package includes functions for creating manhattan plots and q-q plots from GWAS results. The gwasResults data.frame included with the package has simulated results for 16,470 SNPs on 22 chromosomes. Take a look at the data:
str(gwasResults)
'data.frame':   16470 obs. of  4 variables:
 $ SNP: chr  "rs1" "rs2" "rs3" "rs4" ...
 $ CHR: int  1 1 1 1 1 1 1 1 1 1 ...
 $ BP : int  1 2 3 4 5 6 7 8 9 10 ...
 $ P  : num  0.915 0.937 0.286 0.83 0.642 ...
head(gwasResults)
  SNP CHR BP      P
1 rs1   1  1 0.9148
2 rs2   1  2 0.9371
3 rs3   1  3 0.2861
4 rs4   1  4 0.8304
5 rs5   1  5 0.6417
6 rs6   1  6 0.5191
tail(gwasResults)
          SNP CHR  BP      P
16465 rs16465  22 530 0.5644
16466 rs16466  22 531 0.1383
16467 rs16467  22 532 0.3937
16468 rs16468  22 533 0.1779
16469 rs16469  22 534 0.2393
16470 rs16470  22 535 0.2630
How many SNPs on each chromosome?
as.data.frame(table(gwasResults$CHR))
   Var1 Freq
1     1 1500
2     2 1191
3     3 1040
4     4  945
5     5  877
6     6  825
7     7  784
8     8  750
9     9  721
10   10  696
11   11  674
12   12  655
13   13  638
14   14  622
15   15  608
16   16  595
17   17  583
18   18  572
19   19  562
20   20  553
21   21  544
22   22  535

Creating manhattan plots

Now, let's make a basic manhattan plot.
manhattan(gwasResults)

We can also pass in other graphical parameters. Let's add a title (main=), reduce the point size to 50%(cex=), and reduce the font size of the axis labels to 80% (cex.axis=):
manhattan(gwasResults, main = "Manhattan Plot", cex = 0.5, cex.axis = 0.8)

Let's change the colors and increase the maximum y-axis:
manhattan(gwasResults, col = c("blue4", "orange3"), ymax = 12)

Let's remove the suggestive and genome-wide significance lines:
manhattan(gwasResults, suggestiveline = F, genomewideline = F)

Let's look at a single chromosome:
manhattan(subset(gwasResults, CHR == 1))

Let's highlight some SNPs of interest on chromosome 3. The 100 SNPs we're highlighting here are in a character vector called snpsOfInterest. You'll get a warning if you try to highlight SNPs that don't exist.
str(snpsOfInterest)
 chr [1:100] "rs3001" "rs3002" "rs3003" "rs3004" "rs3005" ...
manhattan(gwasResults, highlight = snpsOfInterest)

We can combine highlighting and limiting to a single chromosome:
manhattan(subset(gwasResults, CHR == 3), highlight = snpsOfInterest, main = "Chr 3")

A few notes on creating manhattan plots:
  • Run str(gwasResults). Notice that the gwasResults data.frame has SNP, chromosome, position, and p-value columns named SNP, CHR, BP, and P. If you're creating a manhattan plot and your column names are different, you'll have to pass the column names to the chr=, bp=, p=, and snp= arguments. See help(manhattan) for details.
  • The chromosome column must be numeric. If you have “X,” “Y,” or “MT” chromosomes, you'll need to rename these 23, 24, 25, etc.
  • If you'd like to change the color of the highlight or the suggestive/genomewide lines, you'll need to modify the source code. Search for col="blue", col="red", or col="green3" to modify the suggestive line, genomewide line, and highlight colors, respectively.

Creating Q-Q plots

Creating Q-Q plots is straightforward - simply supply a vector of p-values to the qq() function. You can optionally provide a title.
qq(gwasResults$P, main = "Q-Q plot of GWAS p-values")


Tuesday, May 6, 2014

Mycoplasma Contamination in Cell-Line Based Experiments

For a few years now, my EvoSTAR colleague, Bill Langdon, has been exploring the degree to which Mycoplasma bacteria have contaminated experimental systems and even "infected" online databases with the contents of their genomes.  He and his colleagues have previously shown that Mycoplasma genome sequences have previously been mislabeled as human sequences in several online resources (GenBank, dbEST, and RefSeq).

Early microarray designs were based largely on ESTs from these resources, and as a result, the Affymetrix HG-U133 plus 2.0 array contains probes for Mycoplasma sequences.  Details for these probes can be found here.  Exploiting these probes, Bill and colleagues have also examined the Gene Expression Omnibus for evidence of Mycoplasma contamination, and found around 30 studies (roughly 1% of GEO) that show high expression for this probe, the vast majority of which were from cell cultures.

By their proclivity to infect human experimental cell lines, Bill has playfully described Mycoplasma as having evolved the ability to transmit their genes into online databases.



Continuing this pursuit, Bill recently published an article in BMC BioData Mining illustrating Mycoplasma contamination of the 1000 Genomes Project.  It is unclear what the implications of this contamination are for the integrity of 1000 Genomes Data, as the majority of identified Mycoplasma reads to not map to the human reference genome.  This work should however serve as a bellwether to those performing experiments, or using experimental data from treated cell lines.  In these situations, any contamination might severely taint experimental results.

Tuesday, April 22, 2014

Russ Altman's Translational Bioinformatics Year in Review

A few weeks ago the 2014 AMIA Translational Bioinformatics Meeting (TBI) was held in beautiful San Francisco.  This meeting is full of great science that spans the divide between molecular and clinical research, but a true highlight of this meeting is the closing keynote, traditionally given by Russ Altman.  Each year, he highlights the most exciting articles/publications in translational bioinformatics, and presents them all in a tidy, entertaining talk.  You can find his blog here, along with a link to his slide deck.  

To make accessing the papers a bit easier, here is a list of links.

Controversies:



Clinical Genomics:



Drugs:



Genetic Basis of Disease:



Emerging Data Sources:



Mice:



The Scientific Process:



Odds & Ends:






Tuesday, April 8, 2014

Unsuck your writing

I recently found this little gem of a web app that analyzes the clarity of your writing. Hemingway highlights long, complex, and hard to read sentences. It also highlights complex words where a simple one would do, and highlights adverbs, suggesting you use a stronger verb instead. It highlights passive voice (bad!), and tells you the minimum reading grade level necessary to understand your writing.

When I pasted in some text from an abstract I submitted to ASHG years ago it showed me just how terrible and difficult to understand my scientific writing really is. My abstract text, which should have been hard-hitting and easy to understand at a glance, required a minimum grade 20 reading level. The majority of my 14 sentences were very hard to read and littered with too many adverbs, complicated words, and several uses of passive voice. (I still got a talk out of the submission, so maybe we as scientists enjoy reading tortuous verbiage...).



It looks like a desktop version is in the works, but the web app seemed to work fine, even for a 100,000-word manuscript I tried.

http://www.hemingwayapp.com/

Thursday, March 20, 2014

Visualize coverage for targeted NGS (exome) experiments

I'm calling variants from exome sequencing data and I need to evaluate the efficiency of the capture and the coverage along the target regions.

This sounds like a great use case for bedtools, your swiss-army knife for genomic arithmetic and interval manipulation. I'm lucky enough to be able to walk down the hall and bug Aaron Quinlan, creator of bedtools, whenever I have a "how do I do X with bedtools?" question (which happens frequently).

As open-source bioinformatics software documentation goes, bedtools' documentation is top-notch. In addition, Aaron recently pointed out a work-in-progress bedtools cookbook that he's putting together, giving code examples for both typical and clever uses of bedtools.

Getting back to my exome data, one way to visualize this is to plot the cumulative distribution describing the fraction of targeted bases that were covered by >10 reads, >20 reads, >80 reads, etc. For example, covering 90% of the target region at 20X coverage may be one metric to assess your ability to reliably detect heterozygotes. Luckily for me, there's a bedtools protocol for that.

The basic idea is that for each sample, you're using bedtools coverage to read in both a bam file containing your read alignments and a bed file containing your target capture regions (for example, you can download NimbleGen's V3 exome capture regions here). The -hist option outputs a histogram of coverage for each feature in the BED file as well as a summary histogram for all of the features in the BED file. Below I'm using GNU parallel to run this on all 6 of my samples, piping the output of bedtools to grep out only the lines starting with "all."



Now that I have text files with coverage histograms for all the regions in the capture target, I can now plot this using R.



You can see that sample #2 had some problems. Relative to the rest of the samples you see it take a quick nose-dive on the left side of the plot relative to the others. Running this through Picard showed that 86% of the reads from this sample were duplicates.



Thanks, Aaron, for the tips.

Bedtools protocols: https://github.com/arq5x/bedtools-protocols

Wednesday, March 12, 2014

Software Carpentry at UVA, Redux


Software Carpentry is an international collaboration backed by Mozilla and the Sloan Foundation comprising a team of volunteers that teach computational competence and basic programming skills to scientists. In addition to a suite of online lessons, Software Carpentry also runs two-day on-site bootcamps to teach researchers skills such as using the Unix shell, programming in Python or R, using Git and GitHub for version control, managing data with SQL, and general programming best practices.

It was just over a year ago when I organized UVA's first bootcamp. Last year we reached our 50-person registration limit and had nearly 100 people on the wait list in less than two days. With support from the the Center for Public Health Genomics, the Health Sciences Library, and the Library's Research Data Services, we were able to host another two-day bootcamp earlier this week (we maxed out our registration limit this year as well). A few months ago I started Software Carpentry's training program, which teaches scientists how to teach other scientists how to program. It was my pleasure to be an instructor at this year's bootcamp along with Erik Bray and Mike Hansen.



Erik kicked off day one with a short introduction to what Software Carpentry is all about as well as setting the stage for the rest of the bootcamp -- as more fields of research become increasingly more data rich, computational skills become ever more critical.

I started the morning's lessons on using the Unix shell to get more stuff done in less time. Although there were still a few setup hiccups, things went a lot smoother this year because we provided a virtual machine with all of the necessary tools pre-installed.

We spent the rest of the morning and early afternoon going over version control with Git and collaboration using GitHub. I started out with the very basics -- the hows and whys of using version control, staging, committing, branching, merging, and conflict resolution. After lunch Erik and I did a live demonstration of two different modes of collaboration using GitHub. In the first, I pushed to a repo on GitHub and gave Erik full permissions to access and push to this repo. Here, we pushed and pulled to and from the same repo, and demonstrated what to do in case of a merge conflict. In the second demonstration we used the fork and pull model of collaboration: I created a new repo, Erik forked this, made some edits (using GitHub's web-based editor for simplicity), and submitted a pull request. After the demo, we had participants go through the same exercise -- creating their own repos with feedback about the course so far, and submitting pull requests to each other.

With the remaining hours in the afternoon, Erik introduced Python using the IPython notebook. Since most people were using the virtual machine we provided (or had already installed Anaconda), we had very minimal Python/IPython/numpy version and setup issues that may have otherwise plagued the entire bootcamp (most participants were using Windows laptops). By the end of the introductory python session, participants were using Python and NumPy to simulate logistic population growth with intermittent catastrophic population crashes, and using matplotlib to visualize the results.



Next, Mike introduced the pandas data analysis library for Python, also using an IPython notebook for teaching. In this session, participants used pandas to import and analyze year's worth of weather data from Weather Underground. Participants imported a CSV file, cleaned up the data, parsed dates written as text to create python datetime objects, used the apply function to perform bulk operations on the data, learned how to handle missing values, and synthesized many of the individual components taught in this and the previous session to partition out and perform summary operations on subsets of the data that matched particular criteria of interest (e.g., "how many days did it rain in November when the minimum temperature ranged from 20 to 32 degrees?").

Erik wrapped up the bootcamp with a session on testing code. Erik introduced the concept of testing by demonstrating the behavior of a function without revealing the source code behind it. Participants were asked to figure out what the function did by writing various tests with different input. Finally, participants worked in pairs to implement the function such that all the previously written tests would not raise any assertion errors.

Overall, our second Software Carpentry bootcamp was a qualitative success. The fact that we maxed out registration and filled a wait list within hours two years in a row demonstrates the overwhelming need for such a curriculum for scientists. Science across nearly every discipline is becoming ever more quantitative; researchers are realizing that to be successful, not only do you need to be a good scientist, a great writer, an eloquent speaker, a skilled graphic designer, a clever marketer, an efficient project manager, etc., but that you'll also need to know some programming and statistics also. This week represented the largest Software Carpentry event ever, with simultaneous bootcamps at the University of Virginia, Purdue, New York University, UC Berkeley, and the University of Washington. I can only imagine this trend will continue for the foreseeable future.

Thursday, February 20, 2014

Data Analysis for Genomics MOOC

Last month I told you about Coursera's specializations in data science, systems biology, and computing. Today I was reading Jeff Leek's blog post defending p-values and found a link to HarvardX's Data Analysis for Genomics course, taught by Rafael Irizarry and Mike Love. Here's the course description:

Data Analysis for Genomics will teach students how to harness the wealth of genomics data arising from new technologies, such as microarrays and next generation sequencing, in order to answer biological questions, both for basic cell biology and clinical applications.

The purpose of this course is to enable students to analyze and interpret data generated by modern genomics technology, specifically microarray data and next generation sequencing data. We will focus on applications common in public health and biomedical research: measuring gene expression differences between populations, associated genomic variants to disease, measuring epigenetic marks such as DNA methylation, and transcription factor binding sites.

The course covers the necessary statistical concepts needed to properly design experiments and analyze the high dimensional data produced by these technologies. These include estimation, hypothesis testing, multiple comparison corrections, modeling, linear models, principal component analysis, clustering, nonparametric and Bayesian techniques. Along the way, students will learn to analyze data using the R programming language and several packages from the Bioconductor project.

Currently, biomedical research groups around the world are producing more data than they can handle. The training and skills acquired by taking this course will be of significant practical use for these groups. The learning that will take place in this course will allow for greater success in making biological discoveries and improving individual and population health.


If you've ever wanted to get started with data analysis in genomics and you'd learn R along the way, this looks like a great place to start. The course is set to start April 7, 2014.

HarvardX: Data Analysis for Genomics

Tuesday, February 11, 2014

There is no Such Thing as Biomedical "Big Data"

At the moment, the world is obsessed with “Big Data” yet it sometimes seems that people who use this phrase don’t have a good grasp of its meaning.  Like most good buzz-words, “Big Data” sparks the idea of something grand and complicated, while sounding ordinary enough that listeners feel like they have an intuitive understanding of the concept.  However "Big Data" actually carries a specific technical meaning which is getting lost as the term becomes more popular.    

The phrase's predecessor, "Data Mining" was equally misunderstood.  Originally called "database mining" (a subsequently trademarked term), the term "Data Mining" became common during the 1990s as many businesses rapidly adopted the use of relational database management systems (RDMS) such as Oracle.  RDMS store, optimize, and manage large amounts of data on physical disks for the purpose of rapid search, retrieval and update.  These large collections of data enabled businesses to extract new knowledge useful to their business practices by examining patterns within their data.  Data mining refers to a collection of algorithms that attempt to extract knowledge (in the form of rules or associations) from large amounts of data by processing it in place on the disk, either within the RDMS or within large flat files.  This is an important distinction, as the optimization and speed of algorithms that access data from the disk can be quite different from those which examine data within active memory.

A great example of data mining in practice is the use of frequent itemset mining to target customers with coupons and other discounts.  Ever wonder why you get a yogurt coupon at the register when you check out?  That’s because across thousands of other customers, a subgroup of people with shopping habits similar to yours consistently buys yogurt, and perhaps with a little prompting the grocery vendor can get you to consistently buy yogurt too.

As random access memory (RAM) prices dropped (and virtual memory procedures within operating systems improved), it became much more feasible to process even extremely large datasets within active memory, reducing the need for algorithmic refinements necessary for disk-based processing. But by then, "data mining" (marketed as a way to increase business profits) had become such a popular buzz-word, it was used to refer to any type of data analysis. 

 
In fact, it has been tacked on to numerous books and publications about machine learning methods purely for marketing purposes.  Supposedly even some publishers have modified the titles of machine learning books to include the phrase “data mining” with the hope that it will improve sales.  As a result, the colloquial meaning of data mining has become "a vaguely defined way to discover patterns in data".  To me, this is a tragedy because we have lost a degree of specificity in our language simply because “data mining” sounds cool and profitable.   

Zoom forward to the present day and we see history repeating with the phrase “big data”.  With the increasing popularity of cell phone technologies and the internet, the last decade has seen a dramatic growth in data generated by commercial transactions and online websites.   Many of these large companies (think Google's Search Indices or Facebook and LinkedIn network data) generate data on such a large scale that it cannot be managed within traditional database systems.  These groups have instead turned to large computing clusters that distribute the data over many, many separate machines and file systems.  Partitioning data in this way requires a new class of algorithms that can take advantage of the fact that individual processing units (or nodes of a computing cluster) house their own subset of the overall data.  This is a fundamental paradigm of “Big Data” algorithms, making it distinct from other machine learning and data mining techniques.

A great example of a “Big Data” programming model is the MapReduce framework developed by Google.  The basic idea is that any data manipulation step has a Map function that can be distributed over many, many nodes on a computing cluster that then filter and sort their own portion of the data.  A Reduce function is then performed that combines the selected and sorted data entries into a summary value.  This model is implemented by the popular Big Data system Apache Hadoop.

All of the fuss over “Big Data” is driven by these massive producers of data (on the order of hundreds of terabytes a day), yet the ideas behind “Big Data” are being applied on much smaller datasets even when they are not necessary.  In fact, a rather amusing read from Microsoft Research describes the overhype of “Big Data” algorithms and the surprisingly few analytic operations that truly need these approaches.  The hype is alive and well in the medical and biological research community as well.  In fact, there is an NIH initiative to fund “Big Data to Knowledge”.  I'm the first to cheer for projects dedicated to large-scale data analysis, but by nearly any definition, right now there is no such thing as biomedical Big Data.     

There are certainly processes in biomedical research that produce large amounts of data – first among them is next generation sequencing technology.  In sequencing studies, the raw data from sequencers is aligned and processed to extract the meaningful information (i.e. SNP and CNV calls).  After processing, a full human genome will nearly fit on a floppy disk, which hardly qualifies as "Big Data".  While there may be some interest in storing the raw underlying data (sequence reads), it may prove much more cost effective to simply regenerate the data.  Based on an excellent analysis by Glenn Lockwood, storing four weeks worth of HiSeq X10 raw data may cost nearly $10,000 a month.  If instead we store derived features from the raw data, data storage and manipulation is on the order of typical imputed GWAS.  There will undoubtedly be a desire to reprocess this raw sequence data with new algorithms, but unless storage prices drop rapidly, regenerating the data will be more cost-effective than storage.  Therefore, in my opinion right now the closest thing to qualifying as Big Data would be large multi-center electronic medical record systems, yet even these are typically managed by large-scale relational database systems.

So in practice, our grants will be filled with mentions of Big Data, Web 3.0, "thinking outside the box", value added, hype/innovation, but in reality biomedical sciences are nowhere near approaching the scale needed for real Big Data approaches.    

Thanks to Alex Fish for her thoughtful edits.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.