Monday, December 17, 2012

Differential Isoform Expression With RNA-Seq: Are We Really There Yet?

In case you missed it, a new paper was published in Nature Biotechnology on a method for detecting isoform-level differential expression with RNA-seq Data:

Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature Biotechnology (2012).


RNA-seq enables transcript-level resolution of gene expression, but there is no proven methodology for simultaneously accounting for biological variability across replicates and uncertainty in mapping fragments to isoforms. One of the most commonly used workflows is to map reads with a tool like Tophat or STAR, use a tool like HTSeq to count the number of reads overlapping a gene, then use a negative-binomial count-based approach such as edgeR or DESeq to assess differential expression at the gene level.  

Figure 1 in the paper illustrates the problem with existing approaches, which only count the number of fragments originating from either the entire gene or constitutive exons only. 

Excerpt from figure 1 from the Cuffdiff 2 paper.

In the top row, a change in gene expression is undetectable by counting reads mapping to any exon, and is underestimated if counting only constitutive exons. In the middle row, an apparent change would be detected, but in the wrong direction if using a count-based method alone rather than accounting for which transcript a read comes from and how long that transcript is. How often situations like the middle row happen in reality, that's anyone's guess.


The method presented in this paper, popularized by the cuffdiff method in the Cufflinks software package, claims to address both of these problems simultaneously by modeling variability in the number of fragments generated by each transcript across biological replicates using a beta negative binomial mixture distribution that accounts for both sources of variability in a transcript's measured expression level. This so-called transcript deconvolution is not computationally trivial, and incredibly difficult to explain, but failure to account for the uncertainty (measurement error) from which transcript a fragment originates from can result in a high false-positive rate, especially when there is significant differential regulation of isoforms. Compared to existing methods, the procedure described claims equivalent sensitivity with a much lower false-positive rate when there is substantial isoform-level variability in gene expression between conditions. 


Importantly, the manuscript also addresses and points out weaknesses several undocumented "alternative" workflows that are discussed often on forums like SEQanswers and anecdotally at meetings. These alternative workflows are variations on a theme: combining transcript-level fragment count estimates (like estimates from Cufflinks, eXpress, or RSEM mapping to a transcriptome), with downstream count-based analysis tools like edgeR/DESeq (both R/Bioconductor packages). This paper points out that none of these tools were meant to be used this way, and doing so violates assumptions of underlying statistics used by  both procedures. However, the authors concede that the variance modeling strategies of edgeR and DESeq are robust, and thus assessed the performance of these "alternative" workflows. The results of those experiments show that the algorithm presented in this paper, cuffdiff 2, outperforms other alternative hybrid Cufflinks/RSEM + edgeR/DESeq workflows [see supplementary figure 77 (yes, 77!]).


In theory (and in the simulation studies presented here, see further comments below), the methodology presented here seems to outperform any other competing workflow. So why isn't everyone using it, and why is there so much grumbling about it on forums and at meetings? For many (myself included), the biggest issue is one of reproducibility. There are many discussions about cufflinks/cuffdiff providing drastically different results from one version to the next (see here, here, herehere, and here, for a start). The core I run operates in a production environment where everything I do must be absolutely transparent and reproducible. Reporting drastically different results to my collaborators whenever I update the tools I'm using is very alarming to a biologist, and reduces their confidence in the service I provide and the tools I use. 

Furthermore, a recent methods paper recently compared their tool, DEXSeq, to several different versions of cuffdiff. Here, the authors performed two comparisons: a "proper" comparison, where replicates of treatments (T1-T3) were compared to replicates of controls (C1-C4), and a "mock" comparison, where controls (e.g. C1+C3) were compared to other controls (C2+C4). The most haunting result is shown below, where the "proper" comparison finds relatively few differentially expressed genes, while the "mock" comparison of controls versus other controls finds many, many more differentially expressed genes, and an increasing number with newer versions of cufflinks:

Table S1 from the DEXSeq paper.
This comparison predates the release of Cuffdiff 2, so perhaps this alarming trend ceases with the newer release of Cufflinks. However, it is worth noting that these data shown here are from a real dataset, where all the comparisons in the new Cuffdiff 2 paper were done with simulations. Having done some method development myself, I realize how easy it is to construct a simulation scenario to support nearly any claim you'd like to make.


Most RNA-seq folks would say that the field has a good handle on differential expression at the gene level, while differential expression at isoform-level resolution is still under development. I would tend to agree with this statement, but if cases as presented in Figure 1 of this paper are biologically important and widespread (they very well may be), then perhaps we have some re-thinking to do, even with what we thought were "simple" analyses at the gene level. 

What's your workflow for RNA-seq analysis? Discuss.


  1. Maybe I missed it, but I don't recall a comparison between cuffdiff 2 and the original cuffdiff in the paper. I would think that this is an "obvious" comparison. After all, if cuffdiff 2 warrants a Nature Biotech paper, then surely it is fundamentally different than the previously published cuffdiff?

    As new as cuffdiff 2 is, there are no published papers that have used it, only the original cuffdiff. Which makes the new advances used in cuffdiff 2 and the issues they deal with bring into question a large body of papers based on the original cuffdiff.......

    1. Good observation. Which is why a comparison with the original algorithm may not be presented here. If everything using cuffdiff 1.x.x is flawed in some way (as is implicitly suggested in the comparison presented by Anders et al in the DEXSeq paper)...

  2. Good post!

    An aspect that was not mentioned in the paper was that edgeR and DESeq allow for complex factorial designs, which can be very useful for correcting for experimental batch effects, individual-specific differences etc., while Cuffdiff2 doesn't. This may in some cases be more important than the violations of assumptions that occur when doing e.g. DESeq on eXpress-derived isoform counts (by the way, they don't tell us what these violations are and how serious ... I am not knowledgeable enough to tell.) Therefore I would not throw out edgeR/DESeq from my toolbox even if Cuffdiff2 does work as advertised (which remains to be seen). I've started to run Cuffdiff2 and edgeR side by side on some projects and will try to get a feeling for how they work.

    1. Great point Mikael. Both the DESeq and edgeR vignettes give great examples of how your results are much better when you include potentially confounding factors in your design matrix - something that isn't possible (yet at least) in the cuffdiff framework. Post here again when you've done some of those comparisons - I'd love to see your thoughts.

    2. I am surprised the reviewers let them get away with that comparison. I would have like to see how they compared with just the raw count data, which is how they are designed to work. And anyway I think that the reads that align uniquely to a single contain most of the useful information anyway.

      In any case, I think it is worth remembering that the information sharing models that these programs are using are complex and introduce a lot of biases into the results. For example, the way EdgeR and DESeq calculate variance make it more likely to call high variance genes differentially expressed. You might get a little more power than from, e.g. a good old fashioned t-test when you have a low number of replicates, but then when you try to use your gene lists to do some biology you have to scratch your head a lot to figure out whether the results are from the biology or your statistics, and almost all of your downstream analyses, like a proper gene ontology analysis, become very difficult to do properly.

  3. These arguments over best method are a lot easier if you have a reference with some sort of well designed spike-ins. Can't we create a set of experiments with total human RNA and ectopic RNA spike-ins of differing isoforms in known conc?

    Then use our methods to first fit the spike-in data back to an ectopic reference, and then the rest of the data with the usual ref and tests.

    Hasn't this been done?

    1. Great point. I suppose spike-ins (real or synthetic) aren't as simple for RNA-seq as they were for microarrays - simulating or spiking in data where the ground truth is known that allows for validation of the entire RNA-seq process from alignment --> transcript deconvolution --> differential testing... doing this realistically and fairly is not trivial, but needs to be done nonetheless.

    2. I don't think it was ever that much more difficult - (though spike-ins for differential transcripts is a bit of work). It was just not a priority for manufacturers in the same way it was for Affy/Nimblegen. People bought the propaganda about the accuracy/precision of the technology and moved onto other things without taking care of the basics. We just wont get anywhere until someone returns to some reality based checks.

    3. Ripley: How many drops is this for you, Lieutenant?
      Gorman: Thirty eight... simulated.
      Vasquez: How many *combat* drops?
      Gorman: Uh, two. Including this one.
      Drake: Shit.
      Hudson: Oh, man...

    4. I am working with datasets that are fairly well-designed for answering these kinds of questions. I would be open to some computational collaboration, too, as I'm finding it difficult/impossible to get the various tools to function without wasting too much time.

  4. As said on Twitter earlier this morning (, I did the comparison presented in the DEXSeq paper, for Cuffdiff2 (back in May, version 2.0.0). For C1C3 vs C2C4, I obtained 146 genes called significantly differently expressed. For C1C4 vs C2C3, I obtained 3 DE genes.
    If I remember correctly, C1 and C3 are single-end reads and C2 and C4 are paired-end. Not sure if that plays a role in the difference here.
    For T1-T3 vs C1-C4, I have 62 TE genes.

    I don't have the values for the other comparison (T1T2 vs C2C3).

    I feel that this big issue presented in the DEXSeq paper is better treated with Cuffdiff2, although not perfectly.

    One improvement in Cuffdiff2 is also that it outputs some estimation of "read counts". Can that be used in DESeq/edgeR?

    1. Thanks Nicolas. The reason you're seeing many more in C1C3-C2C4 is probably because of the library type (I believe they account for this in the DEXSeq vignette - something that you can't do with cufflinks).

      Re the read count estimation - they do address this "alternative" workflow in suppl fig 77, but see Michele's comment above.

    2. I haven't found the time yet to redo our comparison with cuffdiff2, so thanks for that. DO you also have the values for two treatments versus two controls?

      BTW, the single-end libraries were C1 and C2, the paired-end ones C3 and C4. We paired the controls into mock groups, such that we always compared one SE and one PE library against the other SE and the other PE one. This is why we did not do C1C2 vs C3C4: seeing many hits there would have been expected, as Stephen points out.

  5. This is a longer explanation:

    If I understand these papers correctly my thoughts are that it is probably not a good idea to use counts from other sources in these programs.

    EdgeR and DESeq use a model where they separate out the shot noise (aka counting noise, sampling noise, Poisson noise) that arises from the count nature of the data from the variance introduced by other types of noise (technical variance and biological variance). Then the extra variance is modeled either as uniform (as in the case of EdgeR - e.g. all biological/technical variances for all transcripts are set to the same over dispersion from Poisson) or as quasi-correlated with read depth for DESeq.

    If you use counts with complex models from other sources you are introducing another, unmodelled source of variance into the count metric that arises from the estimate of where the reads that mapped to multiple isoforms really belong. This comes from the uncertainty of whether you are allocating the read to the right transcript.

    I believe that this source of variance has to be correlated with the Poisson noise in the uniquely mapped reads no matter what approach you use because an assignment of a read that maps to multiple transcripts can really only be based on the information held in the reads that align uniquely to a single transcript. Thus, while counts go up with these reassigned reads the uncertainty in the measurements does not go down in the same way it would go down if you increased your counts through deeper sequencing. If that reallocation information is based on low numbers of uniquely aligned reads than the estimated final counts for the transcript including the reallocated multiply-mapped reads will have more uncertainty. Thus, the Poisson uncertainty in the uniquely aligned reads has to somehow propagate through to final measurements and the statistical approaches don't expect this.

    So this may or may not screw up EdgeR and DESeq's variance calculations, and then their p-values.

    My guess would be that because DESeq assumes that there is a correlation between read depth and extra-Poisson noise it might do a little better with these reads. EdgeR assumes no correlation (I think DESeq is a really great program but agree with EdgeR on this point based on the data that I've looked at). I would guess EdgeR would probably have a bias against calling transcripts with high read counts differentially expressed because it is observing more variance than it expects in the low count transcripts.

    The p-values might still reflect the false positive rate correctly but that doesn't mean the results you get are the results you expect.

    But you'd have to try it with simulations or better yet real data to know whether they break.

    The results probably also depends on the individual experiment's true underlying distribution of variance, too, so there might not be a single "optimal" approach.

    Anyway, realistically, you need an awful lot of reads to call differential transcript expression which is likely to be the limiting factor, and not the statistical approach.

    That's my opinion anyway.

  6. Nice post. I agree with your final thoughts on that we are not fully there to fully understand differential expression at isoform level yet. I have not read the paper yet, but it looks like one of the main points of the paper was to drive the point that cuffdiff 2 performs better than raw count based methods.

    I am wondering whether the paper compared cuffdiff2 to EM based methods like RSEM & IsoEM, which uses the known isoforms to estimate isoform/gene expression levels.

    I would think that for most RNA-Seq analysis interested in known isoforms the EM based approaches would perform equally well or better than cuffdiff 2. Here the assumption is that cuffdiff 2 gives more weight to data (as it infers isforms from data) rather than our prior knowledge about isoforms. And this assumption may not work well for low/average expressed/coverage RNA-Seq data.

  7. I'm not sure I would agree that the non-isoform DE problem has been usefully solved yet, given the drastically different numbers produced by the different programs, or the same program with different parameters. We've tested EdgeR, baySeq, and DegSeq, plus the normalizations in EDASeq, and quite often have seen one method give say 1E-50 while another method finds no DE. With EdgeR, the results depend sensitively on the dispersion parameter, which supposedly is being estimated by replicates, however the sensitive dependence raises a question in my mind as to how accurate the estimation really is (or indeed whether the negative binomial model fits the data well enough to enable such accuracy). Honestly, it's not clear to me that the various software packages are accomplishing more than one could by using intuitive thresholds, e.g. 2x fold-change with at least 50 reads, things like that. The p-values produced by the methods seem close to meaningless.

    1. In my experience the programs that I have looked at have actually all performed really well in that they call a certain number of genes differentially expressed and the number of false positives is consistent with the p-value.

      In that context, I would still expect to see extreme inconsistencies between results of the different programs of the kind that you are describing. But I think this happens because RNA Seq experiments tend to be chronically and extremely underpowered (too few replicates).

      For example, if you have a typical experiment with 3 reps of each condition I would only expect to call about 10% of the genes DE at a 2X fold change differentially expressed. So two different programs could be working correctly but still call very different gene sets.

      I wrote a blog post here that explains that idea in more words and pictures:

      IMOH, I don't think the DE question is something that really needs as much attention as the question of designing adequately powered experiments. That is, I think the DE question is solved-ish, at least for well-behaving libraries, because it really is just asking whether the difference in the means is bigger than the variance and we know how to do that.

      I took a crack at answering the power question with this application:

      The paper that goes with the application explains the statistical model. That got accepted in Bioinformatics yesterday so it should be out soon.

      But I would expect in experiments where there are enough well-controlled replicates that there will be greater consistency between the different methods.

    2. Thanks for sharing a link to your RNA-seq power calculations. Definitely post a comment here when the paper is published.

  8. This paper was recently featured in GenomeWeb (free "premium" access with a .edu address).

  9. Check out this thread on Seqanswers discussing all the different tools for transcriptome reconstruction and differential splicing, and leave a response if you've used any.

    cuffdiff 2

    This long list doesn't even consider tools like DESeq and edgeR using one of the alternative workflows mentioned above.

  10. I prefer the approach of DEXseq, it seems so much more straightforward to think about these things at an exon level rather than at a full transcript level, at least until longer paired end reads are the norm.

    1. I also like the DEXSeq approach (and really like the output it generates). This also seems like what MISO is really doing, though I haven't really tried it out yet.

  11. It looks like there is totally overlooked that there are two types of uncertainties when doing DEG in eukaryotes (that is that one has alternative splicing and threfore more isoforms/transcripts per gene).

    If one uses, edgeR/DESeq/... methods then the lists of DE genes will contain also genes which actually are not differentially expressed but they are alternatively spliced diferrently (that means that in condition 1 one has the isoform A of gene X is expressed and in condition 2 one has that the isoform B of the same gene X is expressed where, for example, the length of isoform A is much shorter than isoform 2). That means in real life that if one goes to the wetlab to validate usinq qPCR those lists of DEG found using edgeR/DESeq then will be very surprised to find out that they do not get validated and actually those are laternatively spliced genes.

    The alternative splicing events are quite common. For example it is known fact the the AR gene in prostate cancer expressed a different and shorter isoform when it is treated to anti-AR drugs!!! Here is more info about this:
    Z. Guo. A New Trick of an Old Molecule: Androgen Receptor Splice Variants Taking the Stage?! Int J Biol Sci. 2011; 7(6): 815–822. ttp://

    Most likely the edgeR/DEseq/... and other count-based methods would find in this case of prostate cancer that the AR gene is DE when in fact it is not (prostate cancer really need AR gene to be expressed)!


Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.