Thursday, October 31, 2013

Real-time streaming differential RNA-seq analysis with eXpress

RNA-seq has been performed routinely for at least 5+ years, yet there is no consensus on the best methodology for analyzing this data. For example, Eduardo Eyras's group recently posted a pre-print on methods to study splicing from RNA-seq, where this great figure was shown:

This illustrates the problem clearly: ignoring alternative workflows that simply count reads mapping to genes or exons and doing a negative binomial-based test, there are thousands of potential paths through an isoform-resolution RNA-seq analysis.

While the gene/exon-count based approach is simpler and arguably more powerful and well-characterized, there are numerous potential problems with this approach, as outlined in the recent cuffdiff 2 paper:

Reproduced from the cuffdiff2 paper under fair use.

I'm not going to go into the merits of feature-count methods versus transcript deconvolution methods - that discussion is best settled by others.

But if you have the coverage needed for isoform reconstruction, perhaps the most commonly-trodden path through the transcript-resolution differential expression analysis is using TopHat for alignment, Cufflinks for assembly, and Cuffdiff for quantitation and differential expression analysis, as described in this recent protocol.

However, if you've ever used cufflinks with lots of samples or lots of reads, you've noted the exponential increase in computational resources necessary to run the analysis. This problem, as well as the performance of an alternative approach, is illustrated in Fig. 2b in the recent publication about a new tool from Lior Pachter's lab, the un-Google-ably named eXpress, for streaming real-time fragment assignment of RNA-seq data:

Reproduced from the eXpress paper under Fair Use.

I won't attempt to explain here how eXpress works other than to tell you it's EM-based, and to direct you to the paper for a fuller explanation. As you can see from the figure above, the resource requirement with more samples and higher coverage increases linearly, consuming only slightly more RAM than the UNIX wc (word count) command, all the while maintaining accuracy comparable to or slightly better than existing state-of-the-art methods like Cufflinks or RSEM.

So, what's the hold-up? Why isn't everyone using eXpress for differential gene/transcript expression with RNA-seq data? Personal preferences and allegiances aside, part of the reason might be because eXpress's estimated fragment abundances are not true counts, and are not optimally treated as such. Strictly speaking, you can't simply take the transcript abundances you get out of eXpress and throw them into a count-based test like those implemented in edgeR or DESeq, and expect those results to be robust and accurate optimal. What's lacking is a mathematical framework (and a user-friendly R or other software package) for conducting statistical analysis of differential abundance across multiple samples of transcript abundances as estimated by eXpress.

I ran into Lior at the CSHL Genome Informatics meeting this morning, and pressed him on when we might see an R package for statistically analyzing eXpress-estimated isoform abundances, and I was told we would see something within a month. I'm going to hold you to that, Lior, so keep an eye out on Bioconductor and Lior's blog for the much needed and long-awaited statistical framework and R package to do this analysis.

#GI2013 folks, I'll see you at the poster session and reception. And to everyone else, as always, keep calm and sequence on.


  1. Great post Stephen - thanks for pointing to eXpress. with the new Proton landing in my lab in the next week; I going to be doing a fair amount of RNAseq over the next few months.

  2. We're very excited about eXpress, because of its efficiency, ease of use, and ability to smoothly handle allele-specific count estimation. I don't really understand the DE statistical issue you allude to, but their paper does recommend putting their "effective count" into a program like edgeR. I imagine that would be just fine, since the DE models come with a huge grain of salt already.

    One thing I would point out about eXpress, which I have posted to their google discussion group and have been corresponding with Adam about, is that currently its results are rather sensitive to the order of the transcripts in the fasta file, at least for transcripts which have a duplicate or near-duplicate (either naturally, or because one has constructed a diploid set for getting allele-specific counts). I hope to see a bug fix or something on this in the near future because we really want to use this tool.

    Another tool which seems to get little press is STAR. In our testing it seems to do a fine job replacing bowtie2, and in about 1/20 of the time. I might mention also that the eXpress authors recommend very relaxed settings for the aligner; using these settings, I could not even get bowtie2 to finish at all on our ~100k transcript set. STAR, meanwhile, runs in under an hour, taking around 15G RAM (we dialed down the index size settings slightly). We completed 80 samples using STAR+eXpress in 36 hours, about 1 billion reads altogether.

    1. Thanks WNelson. I don't remember exactly where I heard it - it may have been in Lior's talk when he came to UVA - but I've definitely heard it isn't appropriate to use eXpress abundances in count-based statistics. (Whether you wind up with the same or very similar results in the end -- that remains to be seen!).

      I have also been very happy with STAR for some time now. There was a poster on it here at CSHL genome informatics, wrt circular RNA finding.

    2. Update, eXpress did have a rather significant bug, which has been fixed in the recent 1.5 release.

  3. We've also have good experience with eXpress. The quantification, if run in streaming mode with the alignment, takes hardly any more time than the read alignment itself. The 'effective counts' Pearson correlate extremely highly with RSEM's 'expected counts'.

    For DE, we use EBSeq, in part because at least qualitatively it maintains eXpress' emphasis on isoform-resolution. The eXpress documentation explicitly says that, while there's currently no DE tool that takes full advantage of the eXpress model, the rounded effective counts are fine to use with existing DE tools. We find that the EBSeq posterior probability of DE again correlates extremely highly when run on eXpress and RSEM output.

    WNelson's suggestion of STAR looks extremely promising.

    1. Regarding EBSeq, do you think it has an advantage over edgeR in the case where one has sufficient biological replicates? Noting that both tools use a negative binomial model, and edgeR fits the dispersion based on the replicate counts, which seems like it would automatically pick up any isoform-specific variability...

    2. I don't think there's a clear advantage in accuracy, but we use it because it gives us the ability to specify arbitrary experimental designs. With edgeR and DESeq, we're stuck with pairwise comparisons or multiple testing correction, where we tend to lose a lot of power. Also EBSeq (and baySeq, but it's much slower) give a Bayesian probability of differential expression, which can be fed into subsequent probabilistic reasoning models like belief networks. Can't do that with p-values.

    3. Interesting, what kind of experimental designs are you doing? It's not something we have considered at all.


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.