cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'
The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:
99115 60567 61.1078 ACCTCAGGA 354 0.357161
This is telling you:
- The total number of reads (99,115).
- The number of unique reads (60,567).
- The frequency of unique reads as a proportion of the total (61%).
- The most abundant sequence (useful for finding adapters, linkers, etc).
- The number of times that sequence is present (354).
- The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:
for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done
This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.
Check out FASTQC for more extensive quality assessment.
Put the awk code into my_awkscript and do:
ReplyDeleteparallel --tag ./my_awkscript ::: *fq
some good stuff there. thanks for sharing.
ReplyDeleteyou should check out bioawk: https://github.com/lh3/bioawk
some usage examples: https://github.com/lh3/bioawk/blob/master/README.bio
This comment has been removed by the author.
ReplyDeleteI know Simon Andrews at Babraham. I can introduce you at ISMB
ReplyDeleteGreat, looking forward to it. Is that you, David Sexton?
ReplyDeleteThanks a lot for sharing this.
ReplyDeleteHi Stephen,
ReplyDeleteIt would be really great if similar awk/shell script could get statistics on SAM/BAM, which would allow users to compare these statistics directly before and after mapping.
BTW, I was impressed by your blog. It is a nice resource.
Thanks !
Thanks for the comments! SAMStat is a decent tool for this, like FASTQC but for mapping stats
DeleteWhat a cool trick! Thank you very much.
ReplyDeletehow can print the same stats for top 100 candidate sequences in a given fastq file?
ReplyDelete