Wednesday, April 18, 2012

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:


cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:
  1. The total number of reads (99,115).
  2. The number of unique reads (60,567).
  3. The frequency of unique reads as a proportion of the total (61%).
  4. The most abundant sequence (useful for finding adapters, linkers, etc).
  5. The number of times that sequence is present (354).
  6. The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:

for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done

This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.

Check out FASTQC for more extensive quality assessment.

Friday, April 6, 2012

RNA-Seq Methods & March Twitter Roundup

There were lots of interesting developments this month that didn't work their way into a full blog post. Here is an incomplete list of what I've been tweeting about over the last few weeks. But first I want to draw your attention to the latest manuscript for a new bioconductor package for doing RNA-seq in R.

DEXSeq vs Cuffdiff. See the pre-publication manuscript from Simon Anders, Alejandro Reyes, and Wolfgang Huber: "Detecting differential usage of exons from RNA-Seq data.DEXSeq is an R package by the same guys who developed the DESeq R package and the HTSeq python scripts. (Incidentally, both DESeq and DEXSeq are rare examples of bioconductor vignettes which are well developed and are a pleasure to read). I often use cufflinks/cuffdiff in the bioinformatics core was because many other tools and methods only allow you to interrogate differential expression at the gene level. Using cufflinks for transcriptome assembly enables you to interrogate transcript/isoform expression, differential splicing, differential coding output, differential promoter usage, etc. DEXSeq uses similar methodology as DESeq, but can give you exon-level differential expression, without going through all the assembly business that cufflinks does. In one of the supplementary tables in their pre-pub manuscript, they compare several versions of cuffdiff to DEXSeq on two datasets. Both of these datasets had biological replicates for treatment and control conditions. They compared treatment to controls, and found DEXseq gave you more significant hits than cuffdiff. Then they compared controls to other controls (ideally should have zero hits) and found cufflinks had way more hits. See p13, p23, tables S1 and S2.

Proper comparison treatment vs control, # significant hits:
DEXSeq: 159
Cuffdiff 1.1: 145
Cuffdiff 1.2: 69
Cuffdiff 1.3: 50

Mock comparison controls vs controls, # significant hits:
DEXSeq: 8
Cuffdiff 1.1: 314
Cuffdiff 1.2: 650
Cuffdiff 1.3: 639

In the UVA Bioinformatics core we strive for reproducibility, scalability, and transparency using the most robust tools and methodology available. It gives me pause to see such alarmingly different results with each new version and each new protocol of a particular tool. What are your thoughts and experiences with using Cufflinks/Cuffdiff, DESeq/DEXSeq, or the many, many other tools for RNA-Seq (MISO, ExpressionPlot, EdgeR, RSEM, easyRNASeq, etc.)? Please share in the comments.

Everything else:

Webinar from @goldenhelixinc: Learning From Our GWAS Mistakes: From experimental design to scientific method https://t.co/KkxAn18p


Sequencing technology does not eliminate biological variability http://t.co/NI3acZgn


[bump] Questions on cutoff setting of FPKM value & know genes filtering in Cuffmerge result http://t.co/iKMZ7Dsd #bioinformatics



Very cool: DNAse-Seq+RNA-seq used to show DNaseI sensitivity eQTLs are a major determinant of gene expression variation http://t.co/nPo3xHVa

Beware using UCSC GTFs in HTSeq/CovergeBed for counting RNA-seq reads. "transcript_id" is repeated as "gene_id"! https://t.co/ADg1Pi6U

Google Scholar Metrics: top 20 journals in bioinformatics http://t.co/QYTf5pyT

A systematic eQTL study of cis-trans #epistasis in 210 HapMap individuals http://t.co/0YHlFQak

Identification of allele-specific alternative mRNA processing via RNA-seq http://t.co/fig9cLlH #bioinformatics @myen

prepub on arXiv + analysis tutorial/walkthrough + AWS EC2 AMI + git repo + ipython notebook = reproducible research done right http://t.co/GPNmpdJD

Altmetrics in the Wild: Using Social Media to Explore Scholarly Impact http://t.co/8xZuIHw9

NSF-NIH Interagency Initiative: Core Techniques and Technologies for Advancing Big Data Science and Engineering http://t.co/W3LUdCsG

New approach from @MarylynRitchie lab to collapsing/combining: using biological pathways rather than positional info http://t.co/ywWj0MNn

The Transcription Factor Encyclopedia http://t.co/DjJUwR10 Paper: http://t.co/b0J8PXO6

NF-kB: where did it come from and why? http://t.co/NLV1mBd0

Cloud BioLinux: pre-configured and on-demand #bioinformatics computing for the genomics community. @myen http://t.co/3kCE0ktH

SCOTUS remands AMP v Myriad (BRCA) patent case to CAFC to consider in light of prometheus decision http://t.co/7CkTa4l0

57 year experiment, Drosophila kept in dark for 1400 generations, many evolutionary changes (record longest postdoc!) http://t.co/wukq8fAf

IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly http://t.co/FN1sYM8f #bioinformatics

Complex disease genetics is complex. Imagine that. Hirschhorn, Visscher, & the usual consortium suspects: http://t.co/Bwopxlx6

MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets http://t.co/pffuHIlO #bioinformatics

Paper about SEQanswers forum published in #Bioinformatics http://t.co/SUlQ6O8c

num-utils - like awk, grep, sort, cut, etc for numbers http://t.co/bgkB5FMV

Nat Protocols: Differential gene & transcript expression analysis of RNA-seq w/ TopHat & Cufflinks http://t.co/U1ZpSE7V #bioinformatics


Wednesday, March 14, 2012

Video Tip: Convert Gene IDs with Biomart

I get asked frequently how to convert from one gene identifier to another. This can be tricky, especially when relying on gene symbols, as Will pointed out in a previous post a few years ago. There are several tools that can do this, including DAVID and the previously mentioned new Biomart ID Converter, but I still prefer using the Ensembl Biomart for this because of its added flexibility and annotation.

I've started putting together video screencasts for things like this, especially when several of the core's clients ask the same question. In this example, I'll show you how to quickly convert from the Affymetrix Mouse Gene 1.0 ST microarray probeset IDs to an Ensembl gene ID and gene symbol.



You can also do this programmatically in R using the biomaRt package in Bioconductor.

Tuesday, March 13, 2012

Redesign by Subtraction

GGD has a new look. I was inspired by Gina Trapani (Smarterware, Lifehacker) to remove any extra lines, links, and other "ink" that doesn't serve any purpose, and I hope the site appears cleaner and easier to read. I also wanted the extra horizontal space for larger images and avoid the dreaded side-scrolling in posts with lots of code like this one. The space will also be handy for short instructional screencasts I've been recording for my clients here in the core, which I'll start posting here soon. Leave a comment if anything appears broken. I hope you like the new look!

Friday, March 9, 2012

find | xargs ... Like a Boss

*Edit March 12* Be sure to look at the comments, especially the commentary on Hacker News - you can supercharge the find|xargs idea by using find|parallel instead.
---

Do you ever discover a trick to do something better, faster, or easier, and wish you could reclaim all the wasted time and effort before your discovery? I've had this happen many times - learning how to use vim, learning about querying relational databases, switching from EndNote to Mendeley, etc.

Now I wish I could reclaim all the time I spent writing perl code to do something that find|xargs can do much more easily. I'll lead you through an example using a problem I ran into and solved easily with find|xargs.

The problem:

I'm doing an RNA-seq experiment where I have three replicates {1,2,3} for two conditions {C,M}. I've run Tophat to align the raw reads to a reference genome, and saved the output of each run to a separate directory. Within those directories I have an alignment (accepted_hits.bam) and some other stuff.



Now, I want to assemble transcripts separately for each alignment. All the alignments are 1 directory deep in the tophat output directory. Furthermore, I want to submit a separate job for each assembly onto our cluster using qsub. In a former life I would have wrapped a perl script around this to write shell scripts that the scheduler would run, and then write a second perl script that would submit all the jobs to the scheduler.

Here's a command that will do it all in one go:

PROCS=8; find `pwd` -name "accepted_hits.bam" | xargs -i echo qsub -l ncpus=$PROCS -- `which cufflinks` -p $PROCS -o {}-cufflinks -g genes.gtf {} | sh

So let's break this down one piece at a time to see what's going on here.

First the find command. If I want to find all the files in the current directory recursively through any subdirectory, I can use the find command with the -name argument. You can see that find is searching "." (the current directory), and is returning the path (relative to ".") for each file it finds.



But if I'm submitting jobs to a scheduler, I want to use the full absolute path. You can modify find's output by telling it to search in "." but give it the full path. Even better, tell find to search in `pwd` (those are backticks, usually above the tab key. The shell will run the pwd command, and insert that output into the find command. See below.



Now, here's where xargs comes in. xargs lets you build new commands based on the standard input. In other words, you can build a command to run on each line that gets piped to xargs. Use the -i option to create a command for each individual line on the pipe, and use {} as a placeholder. The example below should help.



So here's whats going on. In the first step, I'm piping the output of find into xargs, and xargs is creating a new command that will echo "somecommand {}", where {} is a placeholder for what gets piped into xargs. So you can replace somecommand with anycommand -any -options -you -want, and echo all that back out to the STDOUT.

The second part simply pipes whatever is echo'd to the STDIN to the bash shell ( | sh). So bash will run each command it receives on the pipe. Since somecommand doesn't exist, I'm getting an error.

Below, I'm building the command to run cufflinks on each alignment, and dump that output back out to a new directory based on the pattern of the alignment (bam) file name. But since in the next step I want to parallelize this on the cluster, and the cluster won't know that cufflinks is in my path, I need to tell it where to find cufflinks. I could give it the path, but I would rather use the backtick trick I showed you above to let the shell tell the shell where cufflinks resides by using `which cufflinks`.



In the final piece, I'm adding a few more components.



First, I'm setting a shell variable called PROCS. I can access this variable later by using $PROCS. This is how many CPUs I want to allocate to each assembly job. The ";" separates two commands. Instead of using xargs to build a cufflinks command to run at the shell directly, I'm using xargs to build a qsub command that will qsub these jobs to the cluster. To qsub something from the command line, the syntax is

qsub -- myprog -o -p -t arg1 arg2

Where are the PBS directives to qsub (like the number of processors I want, the amount of RAM I require, etc). myprog is the program you're running. -o, -p, -t are options to myprog, and arg1 and arg2 are the arguments to myprog. The two hyphens are required between the qsub options and the commands you actually want to run.

So in the command above, I'm using xargs to build up a qsub command, substituting $PROCS (8 here) for the number of CPUs I require, calling cufflinks from the full path using the backtick trick, telling cufflinks that I want $PROCS (8) processors, use genes.gtf for reference annotation based transcript (RABT) assembly, and run all that cufflinks stuff on the supplied alignment.

In summary, this one-liner will submit six 8-processor jobs. It sure beats writing perl scripts. There are a zillion other uses for piping together find with xargs. If you have a useful one-liner, please share it in the comments. Happy piping!

Tuesday, March 6, 2012

Pathway Analysis for High-Throughput Genomics Studies

I get a lot of requests in the core about running a "pathway analysis." Someone ran a handful of gene expression arrays, or better yet, ran an RNA-seq experiment (with replicates!). These, and many other kinds of high-throughput assays (GWAS, ChIP-seq, etc.) result in a list of genes and some associated p-value, fold change, or other statistic.

Here's some R code to download public data from a study on susceptibility to colorectal cancer. I realize that this is an oversimplified design - that's not the focus here. You can read more about proper design and contrast matrices in the limma vignette. You can read more about the study and the samples in the paper or on the GEO page.



A thoughtful and thorough analysis doesn't end with a list of genes and P-values. You spent way too much money on an experiment just to end up with a list of genes and p-values at some arbitrary cutoff. I often see investigators going down the list and cherry-picking genes that they think are important or might play a role, and trying to create a story involving those cherry-picked genes. I believe it's better to be more agnostic, taking a data-driven approach to framing results in a functional context.

How do you do this? This review recently published in PLoS Comp Bio is an excellent place to start:

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375

I wrote an evaluation of this paper at Faculty of 1000 - here's an expanded version. The paper gives an overview of the available methods for pathway analysis, and categorizes these methods into three functional classes: over-representation analysis, functional class scoring, and topology-based methods. The paper discusses the advantages and limitations of each approach, as well as future development directions.

Over-representation analysis is what you would typically do with something like gene ontology annotations. The Gene Ontology (GO) project is an effort to describe gene products from all organisms using a consistent language (ontologies are formal representations of knowledge domains; GO does this with cell biology). A biological process is typically made up of a group of genes, as opposed to an individual gene alone. The idea of over-representation analysis is that if a biological process is abnormal in a given study, the co-functioning genes are more likely to be selected as a relevant group by the high-throughput screening technologies. For example, if 10% of the your genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the whole human genome (this is the gene population background) that are kinases, then you have significant over-representation in your genes for kinases. There are many, many tools for doing over-representation analysis using gene ontology terms, but they all rely on the same idea, typically using a hypergeometric test. You can create directed acyclic graphs like the one below, color coding nodes based on the statistical significance of the term overrepresentation in your gene list. See the paper above for a discussion of limitations of over-representation analysis, and also check out this paper on the use and misuse of gene ontology annotations.


Functional Class Scoring. One of the biggest limitations of over-representation analysis is that you have to arbitrarily decide what you call significant and what you don't. If you use an FDR threshold of 0.05 and fold change cutoff of 2, you'll lose all those genes with a fold change of 1.95 or FDR 0.051, which are arguably just as important as the genes just under the arbitrary cutoff. Further, over-representation analyses only use the number of genes and ignore how strongly those genes are associated with whatever you're studying. Pathway analysis methods that the above review classifies as "Functional Class Scoring" methods use all the genes you have as well as their association statistics (e.g. fold change), and compute a running enrichment score for gene groupings (based on some functional knowledge like gene ontology or KEGG pathways). Gene Set Enrichment Analysis is one of the more popular approaches in this category. The plot below shows the kind of results you get from GSEA.  From the GSEA documentation: "The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The top portion of the plot shows the running ES for the gene set as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the ES for the gene set. Gene sets with a distinct peak at the beginning (such as the one shown here) or end of the ranked list are generally the most interesting." So this shows that you have significant enrichment in the cancer phenotype (labeled "LL") for genes involved in the KEGG Spliceosome pathway.

Topology based methods. The above methods don't take into account how genes interact with each other (e.g. activation, inhibition, phosphorylation, direct/indirect interaction, ubiquitination, etc). Pathway topology methods this extra information in computing pathway-level statistics. I've recently started using the bioconductor SPIA package (Signaling Pathway Impact Analysis), which integrates lists of differentially expressed genes, their fold changes, and pathway topology, to identify pathways associated with condition you're studying. The code below runs SPIA on the analysis we did above. Make sure you successfully loaded all the R packages in the code above.

You can read more about the results SPIA is giving you in their paper and vignette. Briefly, each pathway is represented by a point, and the x and y axes are showing increasing evidence for the involvement of this pathway in your disease (the total "perturbation" in the pathway based on your gene expression data).

It's worth noting that all of the above mentioned methods have limitations. We don't fully understand biology, and our understanding of molecular networks and signaling pathways is still very low-resolution. We also don't have information about how different isoforms have different effects - which is something we'll get from RNA-seq experiments. Annotations are often incorrect and inaccurate, and we don't have very much cell-type specific or dynamic information about these pathways. Finally, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution and analytical endpoint.

Despite the limitations, the kinds of methods discussed here and reviewed elsewhere (see below) are still very useful for extracting biological meaning and framing results from high-throughput experiments in a functional context. The bioinformatics core here at UVA can do any of the kinds of analyses discussed above. Please fill out a consultation request form and I'll be happy to talk to you about what kinds of insight you may be able to glean from your existing data or in an experiment you're planning.

Other useful literature on the subject:


Holmans, P., Green, E. K., Pahwa, J. S., Ferreira, M. a R., Purcell, S. M., Sklar, P., Owen, M. J., et al. (2009). Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. American journal of human genetics, 85(1), 13-24. doi:10.1016/j.ajhg.2009.05.011

Huang, D. W., Sherman, B. T., & Lempicki, R. a. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research, 37(1), 1-13. doi:10.1093/nar/gkn923

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. (C. A. Ouzounis, Ed.)PLoS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375

Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature reviews. Genetics, 9(7), 509-15. doi:10.1038/nrg2363

Wang, K., Li, M., & Bucan, M. (2007). Pathway-Based Approaches for Analysis of Genomewide Association Studies. American journal of human genetics, 81(6), 1278-1283. doi:10.1086/522374

Yaspan, B. L., & Veatch, O. J. (2011). Strategies for Pathway Analysis from GWAS Data. Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.], Chapter 1(October), Unit1.20. doi:10.1002/0471142905.hg0120s71

Friday, February 24, 2012

I'm Hiring!

I direct the Bioinformatics Core at the University of Virginia, and I'm hiring. Visit this link on the UVA Jobs website for more information. Here's the description:
The University of Virginia Bioinformatics Core is seeking a full-time position as a bioinformatics analyst. The analyst will work with other core staff on grant-funded and chargeback-based projects to manage and analyze large-scale datasets produced by next-generation sequencing. The analyst will identify opportunities and implement solutions for managing, visualizing, analyzing, and interpreting genomic data, including studies of gene expression (RNA-seq and microarrays), pathway analysis, protein-DNA binding (e.g. ChIP-seq), DNA methylation, and DNA variation, using Affymetrix, Illumina, Nimblegen, Agilent, Roche 454, Ion Torrent, and other high-throughput platforms in both human and model organisms. The analyst will work closely with the core director to assist in experimental design and provide expert consultation, technical, and scientific support for UVA investigators, and assist in outreach and training activities. The analyst will organize large-scale sequence data sets, manipulate and format data with perl, python, or other scripting languages, use established software to assess quality and analyze data, schedule and run jobs on a high-performance computing cluster, use Unix or a scripting language to extract meaningful results from output, use software or genome browsers for visualization, and use established databases and techniques for annotating genetic variants and results from expression/DNA-binding experiments. The successful candidate will have a demonstrated ability to translate biological questions into technical designs, and to identify, prioritize, and execute bioinformatics tasks to meet project goals and deadlines. An M.S. in Bioinformatics, Genomics, Biostatistics, or a related field is required for this position.  
I'm Hiring - Bioinformatics Analyst in the UVA Bioinformatics Core

Friday, February 17, 2012

Your Publications (with PMCID) as a PubMed Query

I'm updating my CV and biosketch for a few grant applications, and for some time now, NIH has required you to include the PubMed Central ID for each article you publish that arose from NIH support. I only have a dozen or so papers indexed in PubMed, but I still wanted a way to do this automatically. If you have scores of publications, looking up all the PMCIDs could easily become a hassle.

First, create an account at My NCBI. Under your bibliography, click "Manage My Bibliography." Then click "Add citation," then in the new window that comes up, select "Citation from PubMed" and hit the "Go To PubMed" button.

Now the trick here is constructing a PubMed query that will get your publications only. There are lots of Stephen D. Turner's out there, so I had to get creative. This query construction tip comes to me by way of my colleague here at UVA, Aaron Mackey:

For many people, simple PubMed author searches suffice, e.g. "Pearson WR[Author]". For some, such name-based searches get it mostly right, but may include a few spurious false hits. For these cases, it's easy enough to exclude those false hits explicitly (e.g. "Mackey AJ"[Author] NOT 9850730[PMID] NOT 10730495[PMID] gets rid of the two AJ Mackey publications that are not, in fact, mine). For others, simple author searches do not suffice at all, but usually adding an institution and/or departmental affiliation does narrow the results sufficiently (e.g. for Jeff Smith, Biochemistry: "Smith JS"[au] AND "University of Virginia"[Affiliation] AND "Biochemistry"[Affiliation] identifies the 16 articles for which Jeff Smith is the senior author; Jeff could also add a few collaborative publications by adding those pubmed IDs to the search, i.e. adding "OR 17482543[PMID]" to the end of his query.

When I did this for myself, I searched by author, AND (any of my institutional affiliations separated by OR's), but NOT (any of the PMIDs that were not mine, separated by OR's). Apparently there was once another Stephen D. Turner at UVA in the department of Urology. Here are the results returned by my unique query:

"Turner SD"[Au] AND ("James Madison"[Affiliation] OR Vanderbilt[Affiliation] OR Hawaii[Affiliation] OR "University of Virginia"[Affiliation]) NOT (11514333[PMID] OR 11058553[PMID])

The final step is clicking the "Send to" link at the top right, and sending the results of your query to My Bibliography.


Now, when you are back at My NCBI, you should see a list of all your publications, complete with both the PMID and PMCID, ready to go in your biosketch.


You can then export this bibliography as text, or simply copy/paste. Finally, you have the option of making your bibliography public (example).



Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.