Wednesday, April 18, 2012

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:


cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:
  1. The total number of reads (99,115).
  2. The number of unique reads (60,567).
  3. The frequency of unique reads as a proportion of the total (61%).
  4. The most abundant sequence (useful for finding adapters, linkers, etc).
  5. The number of times that sequence is present (354).
  6. The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:

for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done

This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.

Check out FASTQC for more extensive quality assessment.

9 comments:

  1. Put the awk code into my_awkscript and do:

    parallel --tag ./my_awkscript ::: *fq

    ReplyDelete
  2. some good stuff there. thanks for sharing.

    you should check out bioawk: https://github.com/lh3/bioawk
    some usage examples: https://github.com/lh3/bioawk/blob/master/README.bio

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I know Simon Andrews at Babraham. I can introduce you at ISMB

    ReplyDelete
  5. Great, looking forward to it. Is that you, David Sexton?

    ReplyDelete
  6. Hi Stephen,
    It would be really great if similar awk/shell script could get statistics on SAM/BAM, which would allow users to compare these statistics directly before and after mapping.

    BTW, I was impressed by your blog. It is a nice resource.
    Thanks !


    ReplyDelete
    Replies
    1. Thanks for the comments! SAMStat is a decent tool for this, like FASTQC but for mapping stats

      Delete
  7. What a cool trick! Thank you very much.

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.