Getting Genetics Done: Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

Wednesday, April 18, 2012

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:

cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:

The total number of reads (99,115).
The number of unique reads (60,567).
The frequency of unique reads as a proportion of the total (61%).
The most abundant sequence (useful for finding adapters, linkers, etc).
The number of times that sequence is present (354).
The frequency of that sequence as a proportion of the total number of reads (0.35%).

If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:

for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done

This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.

Check out FASTQC for more extensive quality assessment.

10 comments:

Ole TangeApril 19, 2012 at 2:21 AM
Put the awk code into my_awkscript and do:

parallel --tag ./my_awkscript ::: *fq
ReplyDelete
Replies
Joe BrownApril 19, 2012 at 9:00 AM
some good stuff there. thanks for sharing.

you should check out bioawk: https://github.com/lh3/bioawk
some usage examples: https://github.com/lh3/bioawk/blob/master/README.bio
ReplyDelete
Replies
DavidApril 19, 2012 at 4:31 PM
This comment has been removed by the author.
ReplyDelete
Replies
DavidApril 19, 2012 at 4:34 PM
I know Simon Andrews at Babraham. I can introduce you at ISMB
ReplyDelete
Replies
Stephen TurnerApril 19, 2012 at 6:06 PM
Great, looking forward to it. Is that you, David Sexton?
ReplyDelete
Replies
AnonymousApril 27, 2012 at 1:28 PM
Thanks a lot for sharing this.
ReplyDelete
Replies
AnonymousAugust 18, 2012 at 4:21 PM
Hi Stephen,
It would be really great if similar awk/shell script could get statistics on SAM/BAM, which would allow users to compare these statistics directly before and after mapping.

BTW, I was impressed by your blog. It is a nice resource.
Thanks !

ReplyDelete
Replies
KeysoonFebruary 15, 2013 at 8:56 AM
What a cool trick! Thank you very much.
ReplyDelete
Replies
UnknownMay 1, 2014 at 2:32 AM
how can print the same stats for top 100 candidate sequences in a given fastq file?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Wednesday, April 18, 2012

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

10 comments: