Wednesday, April 18, 2012

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:


cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:
  1. The total number of reads (99,115).
  2. The number of unique reads (60,567).
  3. The frequency of unique reads as a proportion of the total (61%).
  4. The most abundant sequence (useful for finding adapters, linkers, etc).
  5. The number of times that sequence is present (354).
  6. The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:

for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done

This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.

Check out FASTQC for more extensive quality assessment.

Friday, April 6, 2012

RNA-Seq Methods & March Twitter Roundup

There were lots of interesting developments this month that didn't work their way into a full blog post. Here is an incomplete list of what I've been tweeting about over the last few weeks. But first I want to draw your attention to the latest manuscript for a new bioconductor package for doing RNA-seq in R.

DEXSeq vs Cuffdiff. See the pre-publication manuscript from Simon Anders, Alejandro Reyes, and Wolfgang Huber: "Detecting differential usage of exons from RNA-Seq data.DEXSeq is an R package by the same guys who developed the DESeq R package and the HTSeq python scripts. (Incidentally, both DESeq and DEXSeq are rare examples of bioconductor vignettes which are well developed and are a pleasure to read). I often use cufflinks/cuffdiff in the bioinformatics core was because many other tools and methods only allow you to interrogate differential expression at the gene level. Using cufflinks for transcriptome assembly enables you to interrogate transcript/isoform expression, differential splicing, differential coding output, differential promoter usage, etc. DEXSeq uses similar methodology as DESeq, but can give you exon-level differential expression, without going through all the assembly business that cufflinks does. In one of the supplementary tables in their pre-pub manuscript, they compare several versions of cuffdiff to DEXSeq on two datasets. Both of these datasets had biological replicates for treatment and control conditions. They compared treatment to controls, and found DEXseq gave you more significant hits than cuffdiff. Then they compared controls to other controls (ideally should have zero hits) and found cufflinks had way more hits. See p13, p23, tables S1 and S2.

Proper comparison treatment vs control, # significant hits:
DEXSeq: 159
Cuffdiff 1.1: 145
Cuffdiff 1.2: 69
Cuffdiff 1.3: 50

Mock comparison controls vs controls, # significant hits:
DEXSeq: 8
Cuffdiff 1.1: 314
Cuffdiff 1.2: 650
Cuffdiff 1.3: 639

In the UVA Bioinformatics core we strive for reproducibility, scalability, and transparency using the most robust tools and methodology available. It gives me pause to see such alarmingly different results with each new version and each new protocol of a particular tool. What are your thoughts and experiences with using Cufflinks/Cuffdiff, DESeq/DEXSeq, or the many, many other tools for RNA-Seq (MISO, ExpressionPlot, EdgeR, RSEM, easyRNASeq, etc.)? Please share in the comments.

Everything else:

Webinar from @goldenhelixinc: Learning From Our GWAS Mistakes: From experimental design to scientific method https://t.co/KkxAn18p


Sequencing technology does not eliminate biological variability http://t.co/NI3acZgn


[bump] Questions on cutoff setting of FPKM value & know genes filtering in Cuffmerge result http://t.co/iKMZ7Dsd #bioinformatics



Very cool: DNAse-Seq+RNA-seq used to show DNaseI sensitivity eQTLs are a major determinant of gene expression variation http://t.co/nPo3xHVa

Beware using UCSC GTFs in HTSeq/CovergeBed for counting RNA-seq reads. "transcript_id" is repeated as "gene_id"! https://t.co/ADg1Pi6U

Google Scholar Metrics: top 20 journals in bioinformatics http://t.co/QYTf5pyT

A systematic eQTL study of cis-trans #epistasis in 210 HapMap individuals http://t.co/0YHlFQak

Identification of allele-specific alternative mRNA processing via RNA-seq http://t.co/fig9cLlH #bioinformatics @myen

prepub on arXiv + analysis tutorial/walkthrough + AWS EC2 AMI + git repo + ipython notebook = reproducible research done right http://t.co/GPNmpdJD

Altmetrics in the Wild: Using Social Media to Explore Scholarly Impact http://t.co/8xZuIHw9

NSF-NIH Interagency Initiative: Core Techniques and Technologies for Advancing Big Data Science and Engineering http://t.co/W3LUdCsG

New approach from @MarylynRitchie lab to collapsing/combining: using biological pathways rather than positional info http://t.co/ywWj0MNn

The Transcription Factor Encyclopedia http://t.co/DjJUwR10 Paper: http://t.co/b0J8PXO6

NF-kB: where did it come from and why? http://t.co/NLV1mBd0

Cloud BioLinux: pre-configured and on-demand #bioinformatics computing for the genomics community. @myen http://t.co/3kCE0ktH

SCOTUS remands AMP v Myriad (BRCA) patent case to CAFC to consider in light of prometheus decision http://t.co/7CkTa4l0

57 year experiment, Drosophila kept in dark for 1400 generations, many evolutionary changes (record longest postdoc!) http://t.co/wukq8fAf

IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly http://t.co/FN1sYM8f #bioinformatics

Complex disease genetics is complex. Imagine that. Hirschhorn, Visscher, & the usual consortium suspects: http://t.co/Bwopxlx6

MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets http://t.co/pffuHIlO #bioinformatics

Paper about SEQanswers forum published in #Bioinformatics http://t.co/SUlQ6O8c

num-utils - like awk, grep, sort, cut, etc for numbers http://t.co/bgkB5FMV

Nat Protocols: Differential gene & transcript expression analysis of RNA-seq w/ TopHat & Cufflinks http://t.co/U1ZpSE7V #bioinformatics


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.