Wednesday, April 18, 2012

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:


cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:
  1. The total number of reads (99,115).
  2. The number of unique reads (60,567).
  3. The frequency of unique reads as a proportion of the total (61%).
  4. The most abundant sequence (useful for finding adapters, linkers, etc).
  5. The number of times that sequence is present (354).
  6. The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:

for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done

This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.

Check out FASTQC for more extensive quality assessment.

10 comments:

  1. Put the awk code into my_awkscript and do:

    parallel --tag ./my_awkscript ::: *fq

    ReplyDelete
  2. some good stuff there. thanks for sharing.

    you should check out bioawk: https://github.com/lh3/bioawk
    some usage examples: https://github.com/lh3/bioawk/blob/master/README.bio

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I know Simon Andrews at Babraham. I can introduce you at ISMB

    ReplyDelete
  5. Great, looking forward to it. Is that you, David Sexton?

    ReplyDelete
  6. Thanks a lot for sharing this.

    ReplyDelete
  7. Hi Stephen,
    It would be really great if similar awk/shell script could get statistics on SAM/BAM, which would allow users to compare these statistics directly before and after mapping.

    BTW, I was impressed by your blog. It is a nice resource.
    Thanks !


    ReplyDelete
    Replies
    1. Thanks for the comments! SAMStat is a decent tool for this, like FASTQC but for mapping stats

      Delete
  8. What a cool trick! Thank you very much.

    ReplyDelete
  9. how can print the same stats for top 100 candidate sequences in a given fastq file?

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.