Getting Genetics Done: Useful Unix/Linux One-Liners for Bioinformatics

Monday, October 21, 2013

Useful Unix/Linux One-Liners for Bioinformatics

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I've put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

The list is available as a README in this GitHub repo. This list is a start - I would love suggestions for other things to include. To make a suggestion, leave a comment here, or better - open an issue, or even better still - send me a pull request.

Useful one-liners for bioinformatics: https://github.com/stephenturner/oneliners

Alternatively, download a PDF here.

6 comments:

UnknownOctober 22, 2013 at 8:00 AM
Hi Steven.
Here are a couple I like:
Keep only top bit scores in blast hits
--let it be the best bit only
awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }'
--let it be 5 less than the top
awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }'

Fasta header lines to gff format
-assuming the length is in the header as an appended "_length" as in Velvet assembled transcripts
grep '>' file.fasta | awk -F "_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++ }' >file.gff

And an alias for those of us with a dominant left hand:
alias mroe='more'
I also like:
alias grepc='grep --color-always'
alias refresh='source ~/.bash_profile'

Also, some mention of MTPutty and screen. These two items have changed my life (along with Trello). Not quite as much as kids...but seriously close. ;)
ReplyDelete
Replies
tommyOctober 22, 2013 at 11:54 AM
I found many useful ones here http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html
ReplyDelete
Replies
Stephen TurnerOctober 22, 2013 at 3:35 PM
Thanks Bob, Tommy. I'll incorporate some of these into the next commit. I'm also putting together a post on screen. Bob, you'll have to tell me about how you're using Trello sometime.
ReplyDelete
Replies
tommyOctober 25, 2013 at 11:39 AM
To number each line of the file, nl command would be easier:)
ReplyDelete
Replies
LeahOctober 28, 2013 at 12:44 PM
This post is helpful! It would be great if there were a more centralized resource of the "tricks" that make it easier to manipulate these kinds of data.

For working with the results from large genotyping arrays (or GWAS data with whole genome imputation) this is what I use daily:

To remove lines for which column 9 is equal to NA:
awk ' $9!="NA" '

Sort numerically (with logs) (g) by column (k) 9:
sort -gk9 file.txt

Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):
awk '$1=="1" ' file.txt | awk '$3<=1000000' | awk '$3>=2000000'

And my favorite, annotating a results file where column 1 is chromosome (e.g. chr1) and column 3 is position in build 137. It queries the snp137 table of UCSC to turn array variant identifiers (e.g. kgp2929293) to an rs ID. You can modify this to return other annotations as well.
cat filetoannotate.txt | ~/Desktop/variation-toolkit/bin/mysqlucsc --table snp137 -C 1 -S 3 -E 3 --field name > out.txt
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Monday, October 21, 2013

Useful Unix/Linux One-Liners for Bioinformatics

6 comments: