Getting Genetics Done: May 2011

Friday, May 20, 2011

Using NCBI E-Utilities

NCBI has put a lot of effort into unifying their data access and retrieval system -- whether you are searching for a gene, protein, or publication, the results are returned in a similar fashion.

What most people don't realize is that this Entrez system is easily adapted for programmatic access (there are lots of details here). For example, recently I was interested in building a co-authorship network for a few investigators in our center, and rather than searching for and exporting this information using the pubmed website, I used the Entrez E-utilities inside a perl script. Python, Ruby and other scripting languages work great too, but I have gotten used to perl for tasks like this. If you don't have access to a linux distribution with perl installed, you can use strawberry perl in Windows.

To start, we need a web retrieval library called LWP::Simple. If for some reason you don't have this installed by default, you should be able to find it in a CPAN search.

use LWP::Simple;

Then, I set up the base url for the entrez utilities.

my $esearch = "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" . "db=pubmed&retmax=10000&usehistory=y&term=";

In the above line, you can change the db= to any of the databases listed here. The retmax= value is the maximum number of results to return. The term= value is the collection of search terms you wish to use. In my case, I used an authors last name, initials, and our home institution, Vanderbilt. We then execute the query.

my $q = "Bush WS Vanderbilt";
my $esearch_result = get($esearch . $q);

So here, we use a two-step process --

1. First, the program submits a search to the system. When this happens, their web-servers accept the search request and tag it with WebEnv ID (which the web-dev geeks would call a session variable) and a query key, then conducts the search to find identifiers that match the search request. Since we searched the pubmed database, the identifiers are all pubmed ids. This list of ids is stored on the NCBI servers for a brief time until it expires.

To do anything useful with our list of identifiers sitting on the NCBI servers out there, we need to pull the WebEnv ID and the QueryKey from the esearch result. The following code will yank these out of the XML stuff the web server sends back, and it also gives us a count of the records our query found.

$esearch_result =~
m|(\d+).*(\d+).*(\S+)|s;

my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;

To see these, you can print them if you like:

print "Count = $Count; QueryKey = $QueryKey; WebEnv $WebEnv\n";

2. Next, our program must submit a fetch request to fish out the details for each of these identifiers. We do this using their eSummary engine, which works like so:

my $efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgidb=gds&query_key=$QueryKey&WebEnv=$WebEnv";

my $efetch_result = get($efetch);

Now within perl, you can parse through this result to pull out any relevant information you might want. In case you don't know, perl is great for parsing text -- all the slashes and squigglies are for doing regular expression pattern matching. For my example, I was curious to see how many people I've been co-author with and on how many publications. I used the following to pull each author/pubmed id combination for a given search term.

@lines = split(/\n/,$efetch_result);
%citarray = ();
$opendoc = 0;
$id = 0;

foreach $line (@lines)
{
if($line =~ //)
{
$opendoc = 1;
}

if($line =~ /<\/DocSum>/)
{
$opendoc = 0;
}

if($opendoc == 1 && $line =~ /(\d+)<\/Id>/)
{
$id = $1;
}

if($opendoc == 1 && $line =~ /(.*)<\/Item>/)
{
print "$id\t$1\n";
}

}

For the sake of brevity, I'll skip a protracted discussion of the parsing logic I used, but if there is interest, I can elaborate.

In case you are wondering, I loaded this into a database table, joined that table to itself matching on pubmed id, and imported this into Gephi to build our co-authorship network. This was a big hit at the faculty meeting!

Thursday, May 19, 2011

More Command-Line Text Munging Utilities

In a previous post I linked to gcol as a quick and intuitive alternative to awk. I just stumbled across yet another set of handy text file manipulation utilities from the creators of the BEAGLE software for GWAS data imputation and analysis. In addition to several command line utilities for converting and formatting BEAGLE files, there are several tools for doing basic text processing tasks on the command line:

changecolumn.jar - replace values in a column of an input file.
changeline.jar - replace values in a line of an input file.
cut.jar - extract columns from a file.
filtercolumns.jar - filters columns of input data according to the values in a line.
filterlines.jar - filters lines of input data according to the values in a column.
paste.jar - pastes together files that have shared initial columns followed by data columns.
transpose.jar - transposes rows and columns of an input file.

Much of what these tools do can probably be emulated with some creativity with Unix commands and pipes. But since these are all Java archives they should work on any platform, not just Unix/Linux. Hit the link below to see the full list and documentation.

BEAGLE Utilities for text manipulation

Tuesday, May 17, 2011

How to Run a Bioinformatics Core

After evaluating an unnamed bioinformatics core facility, a group of bioinformaticians in Europe wrote up a short list of basic guidelines for organizing a bioinformatics core facility in large research institutes.

The full editorial recently appeared in Bioinformatics (PubMed):

A few of the key points are:

Bioinformatics departments should separate service roles from research laboratories (maintains transparency and clarifies allocation of funds).
Separation of the core into "units" that offer clearly defined services, and installation of "users committees" for each unit that will assist in prioritizing projects needing bioinformatics support.
Bioinformatics cores should provide training to "bench" biologists, and should nominate a point person for outreach and education in publicly available bioinformatics tools and databases to encourage biologists to incorporate bioinformatics into their research workflow.

The authors tell us that at their respective institutions, they have approximately 1 full-time bioinformatician to support 100 scientists. This seems woefully imbalanced to me, and I wonder how this will change in the near future as more basic science research labs start incorporating large-scale -omics data into their research programs.

Kallioniemi, O., Wessels, L., & Valencia, A. (2011). On the organization of bioinformatics core services in biology-based research institutes. Bioinformatics (Oxford, England), 27(10), 2011-2011. doi: 10.1093/bioinformatics/btr125 (PubMed).

Monday, May 16, 2011

gcol == awk++

A while back Will showed you how to ditch Excel for awk, a handy Unix command line tool for extracting certain rows and columns from a text file. While I was browsing the documentation on the previously mentioned PLINK/SEQ library, I came across gcol, another utility for extracting columns from a tab-delimited text file. It can't do anything that awk can't, but it's easier and more intuitive to use for simple text munging tasks. Take a quick look at the gcol examples to see what I mean. And remember, if you need to convert a CSV to a tab-delimited file, just use sed with a Perl regexp: sed -r 's/,/\t/g' myfile.csv

For a demonstration of several other "data science hand tools", check out this post at O'Reilly that covers other handy Unix utilities such as grep, colrm, awk, find, xargs, sort, uniq, and others.

gcol - get columns text utility

Monday, May 9, 2011

Accessing Databases From R

Jeffrey Breen put together a useful slideshow on accessing databases from R. I use RODBC every single day to access my own local MySQL server from R. I've had trouble with RMySQL, so I've always used RODBC instead after setting up my localhost MySQL server as a Windows data source. Once you get accustomed to accessing your data directly with SQL queries rather than dumping files you'll wonder why you waited so long. After you get a handle on using SQL you can take it one step further and use SQL queries on data.frames with the sqldf package for quick and easy data management and group summaries (which happens to be way faster than plyr, aggregate, doBy, and data.table for simple grouping tasks).

Also, if you're new to databases, check out Will's previous post on how to store and organize results from a genetic study using MySQL, or take a look at the w3schools SQL tutorial.

Greater Boston UseR Group Files via (@RevoDavid)

Wednesday, May 4, 2011

PLINK/SEQ for Analyzing Large-Scale Genome Sequencing Data

PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it's syntax will be familiar to PLINK users.

PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase. It supports several data types (multiallelic, phased, imputation probabilities, indels, and structural variants), and can handle datasets much larger than what can fit into memory. PLINK/SEQ also comes bundled with several reference databases of gene transcripts and sequence & variation projects, including dbSNP and 1000 Genomes Project data.

As with PLINK, the documentation is good, and there's a tutorial using 1000 Genomes Project data.

PLINK/SEQ - A library for the analysis of genetic variation data

Monday, May 2, 2011

Golden Helix: A Hitchhiker's Guide to Next Generation Sequencing

This is a few months old but I just got around to reading this series of blog posts on next-generation sequencing (NGS) by Gabe Rudy, Golden Helix's VP of product development. This series gives a seriously useful overview of NGS technology, then delves into the analysis of NGS data at each step, right down to a description of the most commonly used file formats and tools for the job. Check it out now if you haven't already.

Part One gives an overview of NGS trends and technologies. Part Two describes the steps and programs used in the bioinfomatics of NGS data, broken down into three distinct analysis phases. Part Three gives more details on the needs and workflows of the final stage of the analysis of NGS data, or the "sense-making" phase. Finally, a fourth part is a primer on the common formats (FASTQ, SAM/BAM, VCF) and tools (BWA, Bowtie, VCFtools, SAMtools, etc) used in NGS bioinformatics and analysis.

This blog has moved!