Getting Genetics Done: October 2013

Thursday, October 31, 2013

Real-time streaming differential RNA-seq analysis with eXpress

RNA-seq has been performed routinely for at least 5+ years, yet there is no consensus on the best methodology for analyzing this data. For example, Eduardo Eyras's group recently posted a pre-print on methods to study splicing from RNA-seq, where this great figure was shown:

This illustrates the problem clearly: ignoring alternative workflows that simply count reads mapping to genes or exons and doing a negative binomial-based test, there are thousands of potential paths through an isoform-resolution RNA-seq analysis.

While the gene/exon-count based approach is simpler and arguably more powerful and well-characterized, there are numerous potential problems with this approach, as outlined in the recent cuffdiff 2 paper:

Reproduced from the cuffdiff2 paper under fair use.

I'm not going to go into the merits of feature-count methods versus transcript deconvolution methods - that discussion is best settled by others.

But if you have the coverage needed for isoform reconstruction, perhaps the most commonly-trodden path through the transcript-resolution differential expression analysis is using TopHat for alignment, Cufflinks for assembly, and Cuffdiff for quantitation and differential expression analysis, as described in this recent protocol.

However, if you've ever used cufflinks with lots of samples or lots of reads, you've noted the exponential increase in computational resources necessary to run the analysis. This problem, as well as the performance of an alternative approach, is illustrated in Fig. 2b in the recent publication about a new tool from Lior Pachter's lab, the un-Google-ably named eXpress, for streaming real-time fragment assignment of RNA-seq data:

Reproduced from the eXpress paper under Fair Use.

I won't attempt to explain here how eXpress works other than to tell you it's EM-based, and to direct you to the paper for a fuller explanation. As you can see from the figure above, the resource requirement with more samples and higher coverage increases linearly, consuming only slightly more RAM than the UNIX wc (word count) command, all the while maintaining accuracy comparable to or slightly better than existing state-of-the-art methods like Cufflinks or RSEM.

So, what's the hold-up? Why isn't everyone using eXpress for differential gene/transcript expression with RNA-seq data? Personal preferences and allegiances aside, part of the reason might be because eXpress's estimated fragment abundances are not true counts, and are not optimally treated as such. Strictly speaking, you can't simply take the transcript abundances you get out of eXpress and throw them into a count-based test like those implemented in edgeR or DESeq, and expect those results to be ~~robust and accurate~~ optimal. What's lacking is a mathematical framework (and a user-friendly R or other software package) for conducting statistical analysis of differential abundance across multiple samples of transcript abundances as estimated by eXpress.

I ran into Lior at the CSHL Genome Informatics meeting this morning, and pressed him on when we might see an R package for statistically analyzing eXpress-estimated isoform abundances, and I was told we would see something within a month. I'm going to hold you to that, Lior, so keep an eye out on Bioconductor and Lior's blog for the much needed and long-awaited statistical framework and R package to do this analysis.

#GI2013 folks, I'll see you at the poster session and reception. And to everyone else, as always, keep calm and sequence on.

Monday, October 28, 2013

Analysis of #ASHG2013 Tweets

I archived and anlayzed all Tweets with the hashtag #ASHG2013 using my previously mentioned code.

Number of Tweets by date shows Wednesday was the most Tweeted day:

The top used hashtags other than #ASHG2013:

The most prolific users:

And what Twitter analysis would be complete without the widely loved, and more widely hated word cloud:

Edit 8:24am: I have gotten notes that some Tweets were not captured in this archive. This year's ASHG was very actively Tweeted. Unfortunately there are API limits restricting how many results I can return using the t Twitter command line client. A search during a particularly active time of day might have truncated some search results. Please feel free to send me a pull request if you think there's something I can do to improve the automated search code!

Tuesday, October 22, 2013

PubMed Commons: One post-publication peer review forum to rule them all?

Several post-publication peer review forums already exist, such as Faculty of 1000 or PubPeer, that facilitate discussion of papers after they have already been published. F1000 only allows a small number of "faculty" to comment on articles, and access to read commentary requires a paid subscription. PubPeer and similar startup services lack a critical mass of participants to make such a community truly useful. And while the Twitter-/blogosphere space is great for discussing scientific research, commentary is fragmented, scattered across many blogs, Google+, and Twitter feeds (not to mention, discussion content is owned by Twitter, Google, and other dot-com services with no guarantee of permanence).

Enter PubMed Commons.

PubMed Commons is a system that enables researchers to share their opinions about scientific publications. Researchers can comment on any publication indexed by PubMed, and read the comments of others. PubMed Commons is a forum for open and constructive criticism and discussion of scientific issues. It will thrive with high quality interchange from the scientific community. PubMed Commons is currently in a closed pilot testing phase, which means that only invited participants can add and view comments in PubMed.

PubMed Commons is currently in an invitation-only pilot phase. But if this catches on, it could easily become the post-publication peer review forum that solves many of the problems mentioned above. PubMed commons would be open to anyone that registers with a MyNCBI ID, meaning discussion would not be limited to a select "faculty" of few. Discussion would be collated and hosted by NCBI/NLM, which I like to think has a longer half-life than Google's latest foray into social networking or other dot-com ventures. Most importantly, PubMed use is ubiquitous. Whether you use PubMed's search directly or you land on a PubMed abstract from Google Scholar, you'll almost always link out to a paper from a PubMed abstract. This means that the momentum and critical mass to make a forum like this actually useful already exists.

The platform for publishing comments looks pretty simple (and even supports Markdown syntax!):

Another critical feature is the ability to invite the authors of the paper to join the discussion:

Right now PubMed Commons is invitation-only, but I'm optimistic about the public launch and hope to see this catch on.

PubMed Commons: http://www.ncbi.nlm.nih.gov/pubmedcommons/

Monday, October 21, 2013

Useful Unix/Linux One-Liners for Bioinformatics

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I've put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

The list is available as a README in this GitHub repo. This list is a start - I would love suggestions for other things to include. To make a suggestion, leave a comment here, or better - open an issue, or even better still - send me a pull request.

Useful one-liners for bioinformatics: https://github.com/stephenturner/oneliners

Alternatively, download a PDF here.

Thursday, October 10, 2013

De Novo Transcriptome Assembly with Trinity: Protocol and Videos

One of the clearest advantages RNA-seq has over array-based technology for studying gene expression is not needing a reference genome or a pre-existing oligo array. De novo transcriptome assembly allows you to study non-model organisms, cancer cells, or environmental metatranscriptomes. One of the challenges with de novo transcriptome assembly, above and beyond all the challenges associated with genome assembly, is the highly varying abundance (and thus uneven sequencing depth) of different transcripts in a cell.

Several tools have been developed for de novo transcriptome assembly. One of the most widely used is Trinity, developed at the Broad Institute. Trinity is free and open-source, and a recent Nature Protocols article walks through using Trinity for de novo RNA-seq:

Haas, Brian J., et al. "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis." Nature protocols 8.8 (2013): 1494-1512.

In addition, Trinity's creator, Brian Haas, has published four videos on YouTube on de novo transcriptome assembly using Trinity (via RNA-Seq Blog):

Introduction to De Novo RNA-Seq Assembly using Trinity

The General Approach to De novo RNA-Seq Assembly Using De Bruijn Graphs

Trinity - How it works

Strand-specific RNA-Seq is Preferred

Finally, if you're at UVA, we'll be hosting a transcriptome assembly workshop here in November, and registration will be opening soon.

This blog has moved!