Comments on Getting Genetics Done: RNA-Seq Methods & March Twitter Roundup

count-based methods have weaker statistical models...

2012-06-29T04:51:12.101-05:00

count-based methods have weaker statistical models because they do not consider how the length of a transcript affects DE!!!

First if all I think that the comparison is slight...

2012-06-28T17:56:45.817-05:00

First if all I think that the comparison is slightly bias because there are three versions of CuffDiff and onky one version of DEXSeq (there should be 3 vs 3 versions).

Also I bewildered how come it is considered that using raw counts is considered to give more statistical power than when using RPKM-like approach which uses a better statistical model (because it takes into consideration also the gene length)?
The way I see it is this:
raw counts => more statistical power but with a weak statistical model which totally ignores the gene length and the alternative splicing events
cuffdiff => weaker statistical but with a more powerful statiscal model which models the transcript length and the alternative splicing events

Packages like EdgrR, DeSeq, etc. are meant for DEGs but actually genes which are alternatively spliced will be found as DE even if they are not (e.g. same gene in 2 samples -> in one sample only two exons are expressed and in the second sample three exons of the same gene are expressed and the gene is expressed at the same level in both samples ==> this gene will be found DE by EdgeR even that is not just because EdheR ignores the length of the transcript). This is partially fixed in DEXSeq but still looks like a hack (we are doing RNA-seq and not Gene-seq and not exon-seq). We are measuring RNAs when we are doing RNA-seq! I think the for RNA-seq any statistical model should model the transcripts. CuffDiff is a step in the right direction and also BitSeq package (its authors show that actually the RPKM correlates better with ground truth expressions then raw counts; BitSeq works at transcript level ) s

It's interesting that Cufflinks calls so many ...

2012-06-09T14:01:12.721-05:00

It's interesting that Cufflinks calls so many hundreds of differentially expressed exons in the control-vs-control comparison, when the same version detected only 50 in the real comparison. Does this suggest that cuffdiff is somehow amplifying the noise in the absence of signal?

The fact that Cufflinks reports FPKM values does n...

2012-06-08T16:12:47.518-05:00

The fact that Cufflinks reports FPKM values does not mean that internally it masks the difference between long genes with many reads and short genes with few reads. My understanding is that Cufflinks doesn't just calculate the FPKM of each transcript, but also some measure of the uncertainty of that FPKM. So while it may be true that DEXSeq and other raw-count-based methods may have more statistical power than cuffdiff, I don't think you can automatically attribute any differences in statistical power to the use of counts vs FPKM. Or to put it another way, the transformation from counts to FPKM results in a loss of information (and a loss of statistical power), but the transformation from counts to FPKM estimate with uncertainty does not necessarily do so.

Mikael, Thanks for the very detailed feedback. Re...

2012-04-08T14:22:52.823-05:00

Mikael,

Thanks for the very detailed feedback. Re: your last comment - if you were to implement such a testing framework I think you'll have no problem whatsoever getting this published. I think many in the field are in the same boat - similar to the early days of microarrays with lots of tools but few standards, something like this to be able to benchmark and compare tools to each other in an unbiased fashion is sorely needed!

I work as a bioinformatician in a genomics core an...

2012-04-08T09:01:15.606-05:00

I work as a bioinformatician in a genomics core and I agree that it is somewhat terrifying to see results diverge so strongly not only between different software packages, but also between different releases of the same software (or in same cases, different runs of the same release). In many cases, there are good reasons for the discrepancies, but that doesn't help us core bioinformaticians who are more interested than most in providing reliable, reproducible results.

Nowadays, we tend to use BaySeq or edgeR for RNA-seq DE analysis, mostly because these allow for flexible specification of the experimental design - you can set up things like biological and technical replicates, experimental batches, treatments etc. as separate factors in the model. Perhaps DESeq allows for this as well nowadays (it used to be that it didn't). We do not use Cuffdiff, both because we have seen several comparisons like the one you posted and because it's unclear how it tests for differential expression, while edgeR, BaySeq and DESeq are well documented. Also, these three are count based methods, which should give more statistical power as the FPKM (or RPKM) measure masks the difference between long genes with many mapped reads and short genes with few mapped reads, and presumably Cuffdiff works on FPKM - but I seem to recall having heard that it uses the counts as well somehow in the testing, but again, it's not documented how this is done.

By the way, you don't *have* to go through the assembly steps to use Cuffdiff, if you run Cufflinks with the -G flag to quantify using an annotation GTF file. Or did I misunderstand your post?

A slightly different issue is that different packages may be good at different kinds of experiments, e g the relatively new SAMSeq (a nonparametric method) seems to perform really well when you have many biological replicates, and it's possible that certain methods are better for lower/higher sequencing depth, short RNAs, etc.

In view of all these things, I am thinking of setting up some kind of automated testing system for RNA-seq DE analysis tools. You would be able to submit actual or synthetic data and get them analyzed by all available software packages and versions, and get information about the consistency between the output of different tools. Eventually, given statistics about many projects, this would also give some hints about which packages are better for which types of experiments. What do you think?