Tuesday, September 18, 2012

DESeq vs edgeR Comparison

Update (Dec 18, 2012): Please see this related post I wrote about differential isoform expression analysis with Cuffdiff 2.

DESeq and edgeR are two methods and R packages for analyzing quantitative readouts (in the form of counts) from high-throughput experiments such as RNA-seq or ChIP-seq. After alignment, reads are assigned to a feature, where each feature represents a target transcript, in the case of RNA-Seq, or a binding region, in the case of ChIP-Seq. An important summary statistic is the count of the number of reads in a feature (for RNA-Seq, this read count is a good approximation of transcript abundance).

Methods used to analyze array-based data assume a normally distributed, continuous response variable. However, response variables for digital methods like RNA-seq and ChIP-seq are discrete counts. Thus, both DESeq and edgeR methods are based on the negative binomial distribution.

I see these two tools often used interchangeably, and I wanted to take a look at how they stack up to one another in terms of performance, ease of use, and speed. This isn't meant to be a comprehensive evaluation or "bake-off" between the two methods. This would require complex simulations, parameter sweeps, and evaluation with multiple well-characterized real RNA-seq datasets. Further, this is only a start - a full evaluation would need to be much more comprehensive.

Here, I used the newest versions of both edgeR and DESeq, using the well-characterized Pasilla dataset, available in the pasilla Bioconductor package. The dataset is from an experiment in Drosophila investigating the effect of RNAi knockdown of the splicing factor, pasilla. I used the GLM functionality of both packages, as recommended by the vignettes, for dealing with a multifactorial experiment (condition: treated vs. untreated; library type: single-end and paired-end).



Both packages provide built-in functions for assessing overall similarity between samples using either PCA (DESeq) or MDS (edgeR), although these methods operate on the same underlying data and could easily be switched.

PCA plot on variance stabilized data from DESeq:

MDS plot from edgeR:


Per gene dispersion estimates from DESeq:

Biological coefficient of variation versus abundance (edgeR):


Now, let's see how many statistically significant (FDR<0.05) results each method returns:



In this simple example, DESeq finds 820 genes significantly differentially expressed at FDR<0.05, while edgeR is finds these 820 and an additional 371. Let's take a look at the detected fold changes from both methods:

Here, if genes were found differentially expressed by edgeR only, they're colored red; if found by both, colored green. What's striking here is that for a handful of genes, DESeq is (1) reporting massive fold changes, and (2) not calling them statistically significant. What's going on here?

It turns out that these genes have extremely low counts (usually one or two counts in only one or two samples). The DESeq vignette goes through the logic of independent filtering, showing that the likelihood of a gene being significantly differentially expressed is related to how strongly it's expressed, and advocates for discarding extremely lowly expressed genes, because differential expression is likely not statistically detectable.

Count-based filtering can be achieved two ways. The DESeq vignette demonstrates how to filter based on quantiles, while I used the filtering method demonstrated in the edgeR vignette - removing genes without at least 2 counts per million in at least two samples. This filtering code is commented out above - uncomment to filter.

After filtering, all of the genes shown above with apparently large fold changes as detected by DESeq are removed prior to filtering, and the fold changes correlate much better between the two methods. edgeR still detects ~50% more differentially expressed genes, and it's unclear to me (1) why this is the case, and (2) if this is necessarily a good thing.


Conclusions:

Unfortunately, I may have oversold the title here - this is such a cursory comparison of the two methods that I would hesitate to draw any conclusions about which method is better than the other. In addition to finding more significantly differentially expressed genes (again, not necessarily a good thing), I can say that edgeR was much faster than DESeq for fitting GLM models, but it took slightly longer to estimate the dispersion. Further without any independent filtering, edgeR gave me moderated fold changes for the extremely lowly expressed genes for which DESeq returned logFCs in the 20-30 range (but these transcripts were so lowly expressed anyway, they should have been filtered out before any evaluation).

If there's one thing that will make me use edgeR over DESeq (until I have time to do a more thorough evaluation), it's the fact that using edgeR seems much more natural than DESeq, especially if you're familiar with the limma package (pretty much the standard for analyzing microarray data and other continuously distributed gene expression data). Setting up the design matrix and specifying contrasts feels natural if you're familiar with using limma. Further, the edgeR user guide weighs in at 67 pages, filled with many case studies that will help you in putting together a design matrix for nearly any experimental design: paired designs, time courses, batch effects, interactions, etc. The DESeq documentation is still fantastic, but could benefit from a few more case studies / examples.

What do you think? Anyone want to fork my R code and help do this comparison more comprehensively (more examples, simulated data, speed benchmarking)? Is the analysis above fair? What do you find more easy to use, or is ease-of-use (and thus, reproducibility) even important when considering data analysis?

20 comments:

  1. Following up this post - I really liked the analysis done in the DEXSeq supplemental comparing DEXSeq to cufflinks 1.1, 1.2, 1.3, etc. (DEXSeq is developed by some of the same folks as DESeq). Here the authors did a "proper" and "mock" comparison - where the mock comparison was between multiple replicates of the same control group. Ideally, a perfect method would find zero differentially expressed genes. I would love to see this kind of analysis done for newer versions of cufflinks (2.x), side-by-side with DESeq, edgeR, DEXSeq, and possibly others.

    ReplyDelete
  2. Some of the functions aren't actually included in DESeq that you're using.
    > plotDispEsts(d, main="DESeq: Per-gene dispersion estimates")
    Error: could not find function "plotDispEsts"

    >print(plotPCA(varianceStabilizingTransformation(d), intgroup=c("condition", "libType")))
    Error in print(plotPCA(varianceStabilizingTransformation(d), intgroup = c("condition", :
    could not find function "plotPCA"

    I'm getting a lot of other errors too.

    > efit <- glmLRT(efit, coef="conditiontreated")
    Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
    dims [product 0] do not match the length of object [13]

    > etable <- topTags(efit, n=nrow(e))$table
    Error in abs(object$table$logFC) :
    Non-numeric argument to mathematical function

    ReplyDelete
    Replies
    1. I suppose I should have mentioned that I'm using the devel versions of both packages. I updated the R code with a comment to show this as well as my sessionInfo() output.

      Delete
  3. Great post. Thanks for the code.

    It is a little odd that edgeR called every since DESeq gene. Usually there are some/many genes that are mutually exclusive (see the very recent "The bench scientist's guide to statistical analysis of RNA-Seq data" http://www.biomedcentral.com/content/pdf/1756-0500-5-506.pdf)

    I find DESeq and edgeR equally easy to use but I also find equally difficult to comprehend. I'm not sure either has really distinguished itself. I have read Cuffdiff returns a radically different set of genes (or transcripts).

    I think most people will just choose whatever package and parameters returns the best results for them using a faith-based methodology (i.e. the genes they know in their heart are differentially expressed)

    ReplyDelete
    Replies
    1. I've heard (and seen) the same with Cufflinks (and between different versions of Cufflinks, no less). It gets to a larger reproducibility issue that plagues bioinformatics. With microarrays the community had good methods for benchmarking (spike-ins). But it's difficult to benchmark the entire RNA-seq process, and the fact that there fundamentally different ways of analyzing the data (assembly, feature counting, etc) make it more difficult.

      C Titus Brown recently blogged about 'anecdotal science', and as you mentioned, this is certainly how most folks are doing RNA-seq. As a service provider it's important to have methods and workflows that are both reproducible and thoroughly vetted by the community. We're just not there yet with RNA-seq.

      Delete
  4. There have been several more formal comparisons of DESeq and EdgeR in the literature. DESeq almost always seems to be the more conservative of the two in every comparison. I tend to see this as a good thing. I would rather live with more false negatives than false positives.

    But in comparison to Cufflinks-CuffDiff, they are far more alike than different. The amount of variation from one version of Cufflinks to the next is rather disturbing. Furthermore, if you dig into the descriptions of CuffDiff, the authors actually say that they base their models on DESeq.....I think the problems with Cufflinks-CuffDiff have more to do with how they calculate the FPKM, an approach I have always found questionable and so have avoided. In contrast, while the authors of DESeq continue to add and improve it (and change the defaults) they maintain the original methods as options, whereas with Cufflinks, you have to go back to older versions to replicate results.

    Sorry about the ranting more against Cufflinks than talking about DESeq and EdgeR. I just have developed a profound distrust for that program (love Bowtie and Tophat) but love DESeq....and EdgeR to a lesser extent ;)

    ReplyDelete
  5. Spike-ins....I recall a few papers earlier on doing this in RNA-seq, but it seems to be kind of buried and unused now. Frankly, that is one thing I wish I had done earlier on (will from now on) as spike-ins really would help assess results.

    But then you still have RNA-seq papers regularly being published with no biological replicates, along with development of methods of assessing DE without replicates. This simply shocks me.

    ReplyDelete
  6. Even more comprehensive comparison:

    http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long

    ReplyDelete
  7. This might be the wrong place to ask, but does anyone know how to get the genes that contribute most to the axes in the MDS plot in edgeR?

    ReplyDelete
    Replies
    1. Seth - might be better to ask this on SeqAnswers, but this might help. I wrote some code a while back to get the loadings from a principal components analysis on Affy data (square of the PCA loadings is proportional to the contribution of that gene to the component). Here's the code: https://gist.github.com/3986752

      Delete
    2. Thanks for the suggestion. I also posted my question at SeqAnswers. In the meantime I'll work with PCA code you supplied. Many thanks.

      S

      Delete
  8. Has anyone used or tested BitSeq?

    ReplyDelete
  9. Majority of genes are alternatively spliced therefore one gene expresses several isoforms at once.
    See figure http://massgenomics.org/wp-content/uploads/2012/10/gene-number-isoforms.jpg from paper: Gene isoforms expressed by the number annotated, ENCODE Nature 2012, Fig 4A!

    Majority of genes found as differentially expressed by DESeq and edgeR are actually alternatively spliced differentially! Please, note that a differentially expression is different than differentially expressed.

    CuffDiff and BitSeq takes alternatively splicing into consideration!

    ReplyDelete
  10. Regarding the PCA plot on variance stabilized data from DESeq, would it be possible to add the sample names on the graph??

    If it were a another PCA generated by (affycoretools), i would use the text function, example " text( pca$x[,1], pca$x[,2], colnames(d), pos= 2 ) "

    But I cant figure out a way to display sample names in the PCA plot of DESeq.... any ideas?

    Many Thanks
    Zaki

    ReplyDelete
    Replies
    1. I think I found the solution :)

      I just added a few more columns in the metadata and I can manually curate the individual samples from the PCA plot :)...

      What might be interesting (and I haven't figure this out) is... would it be possible to view the PCA from a different dimension?? Eg PC2 & PC3?

      Best
      Zaki

      Delete
    2. Sorry for the spam... but i think I found the solution to my question, which is how to label the sample names as well as viewing different PCA dimension, so just to share

      I was fortunate to find the R code for the plotPCA funtion in DESeq (http://www.bioconductor.org/help/course-materials/2012/BioC2012/DESeq.R)

      plotPCA = function(x, intgroup, ntop=500)
      {
      require("lattice")
      require("genefilter")
      rv = rowVars(exprs(x))
      select = order(rv, decreasing=TRUE)[seq_len(ntop)]
      pca = prcomp(t(exprs(x)[select,]))

      fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : "))
      colours = brewer.pal(nlevels(fac), "Paired")

      pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2,
      aspect = "iso", col=colours,
      main = draw.key(key = list(
      rect = list(col = colours),
      text = list(levels(fac)),
      rep = FALSE)))
      }


      I just modified it into this (after reading a bit about labeling xyplots from this thread (http://stackoverflow.com/questions/6606902/add-labels-on-lattice-xyplot)

      plotPCA = function(x, intgroup, ntop=500)
      {
      require("lattice")
      require("genefilter")
      rv = rowVars(exprs(x))
      select = order(rv, decreasing=TRUE)[seq_len(ntop)]
      pca = prcomp(t(exprs(x)[select,]))
      names= row.names(pData(d)) #added_row
      fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : "))
      colours = brewer.pal(nlevels(fac), "Paired")

      pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2,
      panel=function(x, y, ...) {
      panel.xyplot(x, y, ...);
      ltext(x=x, y=y, labels=names, pos=1, offset=1, cex=0.8)
      } #added_row
      ,
      aspect = "iso", col=colours,
      main = draw.key(key = list(
      rect = list(col = colours),
      text = list(levels(fac)),
      rep = FALSE)))
      }


      Hope it would help anyone with similar questions

      Zaki

      Delete
    3. Zaki: thanks for sharing what you've learned!

      Delete
    4. Hi Zaki,

      I had a similar problem and found tour answer very interesting.

      I don't know how to modify the R code (and if it really is allowed...) so I thought that I could create another function with your code at the beginning of my script. But this failed, I had this error message: "Error in is.data.frame(x) : could not find function "exprs"".

      The only way I found is copying a lot of lines from the page where you found the code of the package. But I am currently using 'DESeq2' so I'm not sure it is a good idea to mix it up.

      Any idea?

      Cheers,
      Mathieu

      Delete
  11. Hello Stephen Turner
    It is very nice posting!! great to learn!
    I have one qustion regarding to the last part to compare the result from EdgeR and DESeq.

    Since, you sort the etable and dtable based on adjusted pvalue. the row order should be different.
    But if you just make table with this command,

    addmargins(table(sig_edgeR=etable$FDR<0.05, sig_DESeq=dtable$padj<0.05))

    do you think it actually find the overlapping one correctly?

    When I tested with this one, I actually compare the overlapped one from EdgeR and DESeq.
    It seems that the sig.genes are different from each other. What I mean is that

    Suppose I have

    > addmargins(table(sig_edgeR=etable$FDR<0.05, sig_DESeq=dtable$padj<0.05))
    sig_DESeq
    sig_edgeR FALSE TRUE Sum
    FALSE 100 0 100
    TRUE 5 6 11
    Sum 105 6 111

    When I looked at those 6 overalpped one, I expect all DEseq 6 significant genes are overlapped with edgeR.
    But it is not!! .Do you have any idea?? DO I miss something?

    THanks in advance

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.