Monday, October 21, 2013

Useful Unix/Linux One-Liners for Bioinformatics

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I've put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

The list is available as a README in this GitHub repo. This list is a start - I would love suggestions for other things to include. To make a suggestion, leave a comment here, or better - open an issue, or even better still - send me a pull request.

Useful one-liners for bioinformatics:

Alternatively, download a PDF here.


  1. Hi Steven.
    Here are a couple I like:
    Keep only top bit scores in blast hits
    --let it be the best bit only
    awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }'
    --let it be 5 less than the top
    awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }'

    Fasta header lines to gff format
    -assuming the length is in the header as an appended "_length" as in Velvet assembled transcripts
    grep '>' file.fasta | awk -F "_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++ }' >file.gff

    And an alias for those of us with a dominant left hand:
    alias mroe='more'
    I also like:
    alias grepc='grep --color-always'
    alias refresh='source ~/.bash_profile'

    Also, some mention of MTPutty and screen. These two items have changed my life (along with Trello). Not quite as much as kids...but seriously close. ;)

  2. I found many useful ones here

  3. Thanks Bob, Tommy. I'll incorporate some of these into the next commit. I'm also putting together a post on screen. Bob, you'll have to tell me about how you're using Trello sometime.

  4. To number each line of the file, nl command would be easier:)

  5. This post is helpful! It would be great if there were a more centralized resource of the "tricks" that make it easier to manipulate these kinds of data.

    For working with the results from large genotyping arrays (or GWAS data with whole genome imputation) this is what I use daily:

    To remove lines for which column 9 is equal to NA:
    awk ' $9!="NA" '

    Sort numerically (with logs) (g) by column (k) 9:
    sort -gk9 file.txt

    Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):
    awk '$1=="1" ' file.txt | awk '$3<=1000000' | awk '$3>=2000000'

    And my favorite, annotating a results file where column 1 is chromosome (e.g. chr1) and column 3 is position in build 137. It queries the snp137 table of UCSC to turn array variant identifiers (e.g. kgp2929293) to an rs ID. You can modify this to return other annotations as well.
    cat filetoannotate.txt | ~/Desktop/variation-toolkit/bin/mysqlucsc --table snp137 -C 1 -S 3 -E 3 --field name > out.txt

    1. thanks, added these (except for the last, which requires mysqlucsc - might add a tutorial on this separately).


Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.