Getting Genetics Done

Friday, May 20, 2011

Using NCBI E-Utilities

NCBI has put a lot of effort into unifying their data access and retrieval system -- whether you are searching for a gene, protein, or publication, the results are returned in a similar fashion.

What most people don't realize is that this Entrez system is easily adapted for programmatic access (there are lots of details here). For example, recently I was interested in building a co-authorship network for a few investigators in our center, and rather than searching for and exporting this information using the pubmed website, I used the Entrez E-utilities inside a perl script. Python, Ruby and other scripting languages work great too, but I have gotten used to perl for tasks like this. If you don't have access to a linux distribution with perl installed, you can use strawberry perl in Windows.

To start, we need a web retrieval library called LWP::Simple. If for some reason you don't have this installed by default, you should be able to find it in a CPAN search.

use LWP::Simple;

Then, I set up the base url for the entrez utilities.

my $esearch = "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" . "db=pubmed&retmax=10000&usehistory=y&term=";

In the above line, you can change the db= to any of the databases listed here. The retmax= value is the maximum number of results to return. The term= value is the collection of search terms you wish to use. In my case, I used an authors last name, initials, and our home institution, Vanderbilt. We then execute the query.

my $q = "Bush WS Vanderbilt";
my $esearch_result = get($esearch . $q);

So here, we use a two-step process --

1. First, the program submits a search to the system. When this happens, their web-servers accept the search request and tag it with WebEnv ID (which the web-dev geeks would call a session variable) and a query key, then conducts the search to find identifiers that match the search request. Since we searched the pubmed database, the identifiers are all pubmed ids. This list of ids is stored on the NCBI servers for a brief time until it expires.

To do anything useful with our list of identifiers sitting on the NCBI servers out there, we need to pull the WebEnv ID and the QueryKey from the esearch result. The following code will yank these out of the XML stuff the web server sends back, and it also gives us a count of the records our query found.

$esearch_result =~
m|(\d+).*(\d+).*(\S+)|s;

my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;

To see these, you can print them if you like:

print "Count = $Count; QueryKey = $QueryKey; WebEnv $WebEnv\n";

2. Next, our program must submit a fetch request to fish out the details for each of these identifiers. We do this using their eSummary engine, which works like so:

my $efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgidb=gds&query_key=$QueryKey&WebEnv=$WebEnv";

my $efetch_result = get($efetch);

Now within perl, you can parse through this result to pull out any relevant information you might want. In case you don't know, perl is great for parsing text -- all the slashes and squigglies are for doing regular expression pattern matching. For my example, I was curious to see how many people I've been co-author with and on how many publications. I used the following to pull each author/pubmed id combination for a given search term.

@lines = split(/\n/,$efetch_result);
%citarray = ();
$opendoc = 0;
$id = 0;

foreach $line (@lines)
{
if($line =~ //)
{
$opendoc = 1;
}

if($line =~ /<\/DocSum>/)
{
$opendoc = 0;
}

if($opendoc == 1 && $line =~ /(\d+)<\/Id>/)
{
$id = $1;
}

if($opendoc == 1 && $line =~ /(.*)<\/Item>/)
{
print "$id\t$1\n";
}

}

For the sake of brevity, I'll skip a protracted discussion of the parsing logic I used, but if there is interest, I can elaborate.

In case you are wondering, I loaded this into a database table, joined that table to itself matching on pubmed id, and imported this into Gephi to build our co-authorship network. This was a big hit at the faculty meeting!

Thursday, May 19, 2011

More Command-Line Text Munging Utilities

In a previous post I linked to gcol as a quick and intuitive alternative to awk. I just stumbled across yet another set of handy text file manipulation utilities from the creators of the BEAGLE software for GWAS data imputation and analysis. In addition to several command line utilities for converting and formatting BEAGLE files, there are several tools for doing basic text processing tasks on the command line:

changecolumn.jar - replace values in a column of an input file.
changeline.jar - replace values in a line of an input file.
cut.jar - extract columns from a file.
filtercolumns.jar - filters columns of input data according to the values in a line.
filterlines.jar - filters lines of input data according to the values in a column.
paste.jar - pastes together files that have shared initial columns followed by data columns.
transpose.jar - transposes rows and columns of an input file.

Much of what these tools do can probably be emulated with some creativity with Unix commands and pipes. But since these are all Java archives they should work on any platform, not just Unix/Linux. Hit the link below to see the full list and documentation.

BEAGLE Utilities for text manipulation

Tuesday, May 17, 2011

How to Run a Bioinformatics Core

After evaluating an unnamed bioinformatics core facility, a group of bioinformaticians in Europe wrote up a short list of basic guidelines for organizing a bioinformatics core facility in large research institutes.

The full editorial recently appeared in Bioinformatics (PubMed):

A few of the key points are:

Bioinformatics departments should separate service roles from research laboratories (maintains transparency and clarifies allocation of funds).
Separation of the core into "units" that offer clearly defined services, and installation of "users committees" for each unit that will assist in prioritizing projects needing bioinformatics support.
Bioinformatics cores should provide training to "bench" biologists, and should nominate a point person for outreach and education in publicly available bioinformatics tools and databases to encourage biologists to incorporate bioinformatics into their research workflow.

The authors tell us that at their respective institutions, they have approximately 1 full-time bioinformatician to support 100 scientists. This seems woefully imbalanced to me, and I wonder how this will change in the near future as more basic science research labs start incorporating large-scale -omics data into their research programs.

Kallioniemi, O., Wessels, L., & Valencia, A. (2011). On the organization of bioinformatics core services in biology-based research institutes. Bioinformatics (Oxford, England), 27(10), 2011-2011. doi: 10.1093/bioinformatics/btr125 (PubMed).

Monday, May 16, 2011

gcol == awk++

A while back Will showed you how to ditch Excel for awk, a handy Unix command line tool for extracting certain rows and columns from a text file. While I was browsing the documentation on the previously mentioned PLINK/SEQ library, I came across gcol, another utility for extracting columns from a tab-delimited text file. It can't do anything that awk can't, but it's easier and more intuitive to use for simple text munging tasks. Take a quick look at the gcol examples to see what I mean. And remember, if you need to convert a CSV to a tab-delimited file, just use sed with a Perl regexp: sed -r 's/,/\t/g' myfile.csv

For a demonstration of several other "data science hand tools", check out this post at O'Reilly that covers other handy Unix utilities such as grep, colrm, awk, find, xargs, sort, uniq, and others.

gcol - get columns text utility

Monday, May 9, 2011

Accessing Databases From R

Jeffrey Breen put together a useful slideshow on accessing databases from R. I use RODBC every single day to access my own local MySQL server from R. I've had trouble with RMySQL, so I've always used RODBC instead after setting up my localhost MySQL server as a Windows data source. Once you get accustomed to accessing your data directly with SQL queries rather than dumping files you'll wonder why you waited so long. After you get a handle on using SQL you can take it one step further and use SQL queries on data.frames with the sqldf package for quick and easy data management and group summaries (which happens to be way faster than plyr, aggregate, doBy, and data.table for simple grouping tasks).

Also, if you're new to databases, check out Will's previous post on how to store and organize results from a genetic study using MySQL, or take a look at the w3schools SQL tutorial.

Greater Boston UseR Group Files via (@RevoDavid)

Wednesday, May 4, 2011

PLINK/SEQ for Analyzing Large-Scale Genome Sequencing Data

PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it's syntax will be familiar to PLINK users.

PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase. It supports several data types (multiallelic, phased, imputation probabilities, indels, and structural variants), and can handle datasets much larger than what can fit into memory. PLINK/SEQ also comes bundled with several reference databases of gene transcripts and sequence & variation projects, including dbSNP and 1000 Genomes Project data.

As with PLINK, the documentation is good, and there's a tutorial using 1000 Genomes Project data.

PLINK/SEQ - A library for the analysis of genetic variation data

Monday, May 2, 2011

Golden Helix: A Hitchhiker's Guide to Next Generation Sequencing

This is a few months old but I just got around to reading this series of blog posts on next-generation sequencing (NGS) by Gabe Rudy, Golden Helix's VP of product development. This series gives a seriously useful overview of NGS technology, then delves into the analysis of NGS data at each step, right down to a description of the most commonly used file formats and tools for the job. Check it out now if you haven't already.

Part One gives an overview of NGS trends and technologies. Part Two describes the steps and programs used in the bioinfomatics of NGS data, broken down into three distinct analysis phases. Part Three gives more details on the needs and workflows of the final stage of the analysis of NGS data, or the "sense-making" phase. Finally, a fourth part is a primer on the common formats (FASTQ, SAM/BAM, VCF) and tools (BWA, Bowtie, VCFtools, SAMtools, etc) used in NGS bioinformatics and analysis.

Monday, April 25, 2011

Annotated Manhattan plots and QQ plots for GWAS using R, Revisited

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

Last year I showed you how to create manhattan plots, and later how to highlight regions of interest, using ggplot2 in R. The code was slow, required a lot of memory, and was difficult to maintain and modify.

I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you're familiar with base R's plot options and graphical parameters, most of these can now be passed to the functions to tweak the plots' appearance. The code also behaves differently depending on whether you have results for one or more than one chromosome.

Here's a quick demo.

First, either copy and paste the code from GitHub, or run the following commands in R to download and source the code from GitHub (you can't directly read from https in R, so you have to download the file first, the source it). Note the command is different on Mac vs Windows.

Download the function code on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r", method="curl")

Download the function code on Windows, leave out the method="curl" argument:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r")

Now, source the script containing the functions.

source("./qqman.r")

Next, load some GWAS results, and take a look at the relevant columns (same as above, download first then read locally from disk). This is standard output from PLINK's --assoc option.

Download example data on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz", method="curl")

Download example data on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz")

Read in the sample results, and take a look at the first few lines:

results = read.table(gzfile("./plink.assoc.txt.gz"), header=T)

head(subset(results, select=c(SNP, CHR, BP, P)))

The manhattan function assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, genomic coordinate, and p-value. Missing values (where regression failed to converge, for instance) should be recoded NA before reading into R. Do this with a quick sed command. Here's what the data looks like:

         SNP CHR        BP       P
1 rs10495434   1 235800006 0.62220
2  rs6689417   1  46100028 0.06195
3  rs3897197   1 143700035 0.10700
4  rs2282450   1 202300047 0.47280
5   rs567279   1  66400050      NA
6 rs11208515   1  64900051 0.53430

Now, create a basic manhattan plot (click the image to enlarge):

manhattan(results)

If you type args(manhattan) you can see the options you can set. Here are a few:

colors: this is a character vector specifying the colors to cycle through for coloring each point. Here's a PDF chart of R's color names.

ymax: this is the y-axis limit. If ymax="max" (default), the y-axis will always be a little bit higher than the most significant -log10(p-value). Otherwise you can set this value yourself.

cex.x.axis: this can be used to shrink the x-axis labels by setting this value less than 1. This is handy if some of the tick labels aren't showing up because the plot region is too small.

*** Update March 9 2012*** cex.x.axis is deprecated. To change the x-axis size, use the default base graphics argument cex.axis.

limitchromosomes: you can limit which chromosomes you want to display. By default this restricts the plot to chromosomes 1-23(x).

suggestiveline and genomewideline: set these to FALSE if you don't want threshold lines, or change the thresholds yourself.

annotate: by default this is undefined. If you supply a character vector of SNP names (e.g. rs numbers), any SNPs in the results data frame that also show up here will be highlighted in green by default. example below.

... : The dot-dot-dot means you can pass most other plot or graphical parameters to these functions (e.g. main, cex, pch, etc).

Make a better looking manhattan plot. Change the plot colors, point shape, and remove the threshold lines:

manhattan(results, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)

Now, read in a text file with SNP names that you want to highlight, then make a manhattan plot highlighting those SNPs, and give the plot a title:

Download a SNP list on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt", method="curl")

Download a SNP list on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt")

Read in the SNP list, and create an annotated plot:

snps_to_highlight = scan("./snps.txt", character())

manhattan(results, annotate=snps_to_highlight, pch=20, main="Manhattan Plot")

Finally, zoom in and plot only the results for chromosome 11, still highlighting those results. Notice that the x-axis changes from chromosome to genomic coordinate.

manhattan(subset(results, CHR==11), pch=20, annotate=snps_to_highlight, main="Chr11")

Finally, make a quantile-quantile plot of the p-values. To make a basic qq-plot of the p-values, pass the qq() function a vector of p-values:

qq(results$P)

Perhaps we should have made the qq-plot first, as it looks like we might have some unaccounted-for population stratification or other bias.

The code should run much faster and use less memory than before. All the old functions that use ggplot2 are still available, now prefixed with "gg." Please feel free to use, modify, and redistribute, but kindly link back to this post. The function source code, data, SNPs of interest, and example code used in this post are all available in the qqman GitHub repository.

This blog has moved!