Getting Genetics Done: Using NCBI E-Utilities

NCBI has put a lot of effort into unifying their data access and retrieval system -- whether you are searching for a gene, protein, or publication, the results are returned in a similar fashion.

What most people don't realize is that this Entrez system is easily adapted for programmatic access (there are lots of details here). For example, recently I was interested in building a co-authorship network for a few investigators in our center, and rather than searching for and exporting this information using the pubmed website, I used the Entrez E-utilities inside a perl script. Python, Ruby and other scripting languages work great too, but I have gotten used to perl for tasks like this. If you don't have access to a linux distribution with perl installed, you can use strawberry perl in Windows.

To start, we need a web retrieval library called LWP::Simple. If for some reason you don't have this installed by default, you should be able to find it in a CPAN search.

use LWP::Simple;

Then, I set up the base url for the entrez utilities.

my $esearch = "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" . "db=pubmed&retmax=10000&usehistory=y&term=";

In the above line, you can change the db= to any of the databases listed here. The retmax= value is the maximum number of results to return. The term= value is the collection of search terms you wish to use. In my case, I used an authors last name, initials, and our home institution, Vanderbilt. We then execute the query.

my $q = "Bush WS Vanderbilt";
my $esearch_result = get($esearch . $q);

So here, we use a two-step process --

1. First, the program submits a search to the system. When this happens, their web-servers accept the search request and tag it with WebEnv ID (which the web-dev geeks would call a session variable) and a query key, then conducts the search to find identifiers that match the search request. Since we searched the pubmed database, the identifiers are all pubmed ids. This list of ids is stored on the NCBI servers for a brief time until it expires.

To do anything useful with our list of identifiers sitting on the NCBI servers out there, we need to pull the WebEnv ID and the QueryKey from the esearch result. The following code will yank these out of the XML stuff the web server sends back, and it also gives us a count of the records our query found.

$esearch_result =~
m|(\d+).*(\d+).*(\S+)|s;

my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;

To see these, you can print them if you like:

print "Count = $Count; QueryKey = $QueryKey; WebEnv $WebEnv\n";

2. Next, our program must submit a fetch request to fish out the details for each of these identifiers. We do this using their eSummary engine, which works like so:

my $efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgidb=gds&query_key=$QueryKey&WebEnv=$WebEnv";

my $efetch_result = get($efetch);

Now within perl, you can parse through this result to pull out any relevant information you might want. In case you don't know, perl is great for parsing text -- all the slashes and squigglies are for doing regular expression pattern matching. For my example, I was curious to see how many people I've been co-author with and on how many publications. I used the following to pull each author/pubmed id combination for a given search term.

@lines = split(/\n/,$efetch_result);
%citarray = ();
$opendoc = 0;
$id = 0;

foreach $line (@lines)
{
if($line =~ //)
{
$opendoc = 1;
}

if($line =~ /<\/DocSum>/)
{
$opendoc = 0;
}

if($opendoc == 1 && $line =~ /(\d+)<\/Id>/)
{
$id = $1;
}

if($opendoc == 1 && $line =~ /(.*)<\/Item>/)
{
print "$id\t$1\n";
}

}

For the sake of brevity, I'll skip a protracted discussion of the parsing logic I used, but if there is interest, I can elaborate.

In case you are wondering, I loaded this into a database table, joined that table to itself matching on pubmed id, and imported this into Gephi to build our co-authorship network. This was a big hit at the faculty meeting!

3 comments:

RajeshSeptember 25, 2011 at 10:12 AM
I like you blogs will,
Thanks for sharing information.

one correction in URL though fcgidb should be fcgi?db
AnonymousDecember 16, 2011 at 10:28 PM
Wouldn't BioPerl be easier by taking some of the busy work out of this? (Haven't used it, but I like BioPython)
Jeff BarrettMarch 22, 2012 at 1:58 PM
Thanks for the tip -- very nicely described, and a handy tool.

Note: Only a member of this blog may post a comment.

Getting Genetics Done

This blog has moved!

Friday, May 20, 2011

Using NCBI E-Utilities

3 comments: