Getting Genetics Done: Perl

Showing posts with label Perl. Show all posts

Monday, January 13, 2014

How To Install BioPerl Without Root Privileges

I've seen this question asked and partially answered all around the web. As with anything related to Perl, I'm sure there is more than one way to do it. Here's how I do it with Perl 5.10.1 on CentOS 6.4.

First, install local::lib with bootstrapping method as described here.

Next, put this in your .bashrc so that it's executed every time you log in:

Log out then log back in, then download and install BioPerl, answering "yes" to any question asking you to download and install dependencies when necessary:

Friday, May 20, 2011

Using NCBI E-Utilities

NCBI has put a lot of effort into unifying their data access and retrieval system -- whether you are searching for a gene, protein, or publication, the results are returned in a similar fashion.

What most people don't realize is that this Entrez system is easily adapted for programmatic access (there are lots of details here). For example, recently I was interested in building a co-authorship network for a few investigators in our center, and rather than searching for and exporting this information using the pubmed website, I used the Entrez E-utilities inside a perl script. Python, Ruby and other scripting languages work great too, but I have gotten used to perl for tasks like this. If you don't have access to a linux distribution with perl installed, you can use strawberry perl in Windows.

To start, we need a web retrieval library called LWP::Simple. If for some reason you don't have this installed by default, you should be able to find it in a CPAN search.

use LWP::Simple;

Then, I set up the base url for the entrez utilities.

my $esearch = "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" . "db=pubmed&retmax=10000&usehistory=y&term=";

In the above line, you can change the db= to any of the databases listed here. The retmax= value is the maximum number of results to return. The term= value is the collection of search terms you wish to use. In my case, I used an authors last name, initials, and our home institution, Vanderbilt. We then execute the query.

my $q = "Bush WS Vanderbilt";
my $esearch_result = get($esearch . $q);

So here, we use a two-step process --

1. First, the program submits a search to the system. When this happens, their web-servers accept the search request and tag it with WebEnv ID (which the web-dev geeks would call a session variable) and a query key, then conducts the search to find identifiers that match the search request. Since we searched the pubmed database, the identifiers are all pubmed ids. This list of ids is stored on the NCBI servers for a brief time until it expires.

To do anything useful with our list of identifiers sitting on the NCBI servers out there, we need to pull the WebEnv ID and the QueryKey from the esearch result. The following code will yank these out of the XML stuff the web server sends back, and it also gives us a count of the records our query found.

$esearch_result =~
m|(\d+).*(\d+).*(\S+)|s;

my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;

To see these, you can print them if you like:

print "Count = $Count; QueryKey = $QueryKey; WebEnv $WebEnv\n";

2. Next, our program must submit a fetch request to fish out the details for each of these identifiers. We do this using their eSummary engine, which works like so:

my $efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgidb=gds&query_key=$QueryKey&WebEnv=$WebEnv";

my $efetch_result = get($efetch);

Now within perl, you can parse through this result to pull out any relevant information you might want. In case you don't know, perl is great for parsing text -- all the slashes and squigglies are for doing regular expression pattern matching. For my example, I was curious to see how many people I've been co-author with and on how many publications. I used the following to pull each author/pubmed id combination for a given search term.

@lines = split(/\n/,$efetch_result);
%citarray = ();
$opendoc = 0;
$id = 0;

foreach $line (@lines)
{
if($line =~ //)
{
$opendoc = 1;
}

if($line =~ /<\/DocSum>/)
{
$opendoc = 0;
}

if($opendoc == 1 && $line =~ /(\d+)<\/Id>/)
{
$id = $1;
}

if($opendoc == 1 && $line =~ /(.*)<\/Item>/)
{
print "$id\t$1\n";
}

}

For the sake of brevity, I'll skip a protracted discussion of the parsing logic I used, but if there is interest, I can elaborate.

In case you are wondering, I loaded this into a database table, joined that table to itself matching on pubmed id, and imported this into Gephi to build our co-authorship network. This was a big hit at the faculty meeting!

Tuesday, November 16, 2010

Parallelize IBD estimation with PLINK

Obtaining the probability that zero, one, or two alleles are shared identical by descent (IBD) is useful for many reasons in a GWAS analysis. A while back I showed you how to visualize sample relatedness using R and ggplot2, which requires IBD estimates. Using plink --genome uses IBS and allele frequencies to infer IBD. While a recent article in Nature Reviews Genetics on IBD and IBS analysis demonstrates potentially superior approaches, PLINK's approach is definitely the easiest because of PLINK's superior data management capabilities. The problem with IBD inference is that while computation time is linear with respect to the number of SNPs, it's geometric (read: SLOW) with respect to the number of samples. With GWAS data on 10,000 samples, (10,000 choose 2) = 49,995,000 pairwise IBD estimates. This can take quite some time to calculate on a single processor.

A developer in Will's lab, Justin Giles, wrote a Perl script which utilizes one of PLINK's advanced features, --genome-lists, which takes two files as arguments. You can read about this feature under the advanced hint section of the PLINK documentation of IBS clustering. Each of these files contain lists of family IDs and individual IDs of samples for whom you'd like to calculate IBD. In other words, you can break up the IBD calculations by groups of samples, instead of requiring a single process to do it all. The Perl script also takes the base filename of your binary pedfile and parses the .fam file to split up the list of individuals into small chunks. The size of these chunks are specified by the user. Assuming you have access to a high-performance computing cluster using Torque/PBS for scheduling and job submission, the script also writes out PBS files that can be used to submit each of the segments to a node on a supercomputing cluster (although this can easily be modified to fit other parallelization frameworks, so modify the script as necessary). The script also needs all the minor allele frequencies (which can easily be attained with the --freq option in PLINK).

One of the first things the script does is parses and splits up your fam file into chunks of N individuals (where N is set by the user - I used 100 and estimation only took ~15 minutes). This can be accomplished by a simple gawk command:

gawk '{print $1,$2}' data.fam | split -d -a 3 -l 100 - tmp.list

Then the script sets up some PBS scripts (like shell scripts) to run PLINK commands:

At which point you'll have the output files...

data.sub.1.1.genome
data.sub.1.2.genome
data.sub.1.3.genome
data.sub.1.4.genome
data.sub.2.2.genome
data.sub.2.3.genome
data.sub.2.4.genome
data.sub.3.3.genome
data.sub.3.4.genome
data.sub.4.4.genome

...that you can easily concatenate.

Here's the perl script below. To run it, give it the full path to your binary pedfile, the number of individuals in each "chunk" to infer IBD between, and the fully qualified path to your .frq file that you get from running plink --freq. If you're not using PBS to submit jobs, you'll have to modify the code a little bit in the main print statement in the middle. If you're not running this in your /scratch/username/ibd directory, you'll want to change that on line 57. You'll also want to change your email address on line 38 if you want to receive emails from your scheduler if you use PBS.

After you submit all these jobs, you can very easily run these commands to concatenate the results and clean up the temporary files:

cat data.sub.*genome > results.genome
rm tmp.list*
rm data.sub.*

Tuesday, August 3, 2010

Convert PLINK output to tab or comma delimited CSV using Perl

Last week Will showed you a bash script version of a sed command covered here a while back that would convert PLINK output from the default variable space-delimited format to a more database-loading-friendly tab or comma delimited file. A commenter asked how to do this on windows, so I'll share the way I do this using a perl script which you can use on windows after installing ActivePerl. First copy the code below and save the file as cleanplink.pl somewhere in your path.

#!/usr/bin/perl

# cleanplink.pl
# (c) Stephen D. Turner 2010 http://www.stephenturner.us/
# This is free open-source software.
# See http://gettinggeneticsdone.blogspot.com/p/copyright.html

my $help = "\nUsage: $0 <input whitespace file> <tab or comma>\n\n";
die $help if @ARGV<2;

$delimiter=pop(@ARGV);
die $help unless ($delimiter=~/tab/i|$delimiter=~/comma/i);
@inputfiles=@ARGV;

if ($delimiter =~ /comma/i) {
    foreach (@inputfiles) {

        open (IN,"<$_");
        open (OUT,">$_.csv");
        while (<IN>) {
            chomp;
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/,/g; #Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
        }
    }
} elsif ($delimiter =~ /tab/i) {
    foreach (@inputfiles) {
        open (IN,"<$_");
        open (OUT,">$_.tab");
        while (<IN>) {
            chomp;
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/\t/g;#Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
        }
    }
} else {
    die $help;
}

Run the program with the first argument(s) as the PLINK output file(s) you want to convert, and the last argument being either "comma" or "tab" without the quotes. It'll create another file in the current directory ending with either .csv or .tab. Look below to see cleanplink.pl in action.

turnersd@provolone:~/tmp$ ls
plink.qassoc
turnersd@provolone:~/tmp$ cat plink.qassoc
 CHR         SNP         BP    NMISS       BETA         SE         R2        T            P
   1   rs3094315     742429     3643    -0.2461     0.2703  0.0002275  -0.9102       0.3628
   1  rs12562034     758311     3644    -0.1806     0.3315  8.149e-05  -0.5448       0.5859
   1   rs3934834     995669     3641    0.04591     0.2822  7.271e-06   0.1627       0.8708
   1   rs9442372    1008567     3645     0.1032     0.2063  6.868e-05   0.5002       0.6169
   1   rs3737728    1011278     3644     0.1496     0.2268  0.0001195   0.6598       0.5094
   1   rs6687776    1020428     3645    -0.5378     0.2818   0.000999   -1.909      0.05639
   1   rs9651273    1021403     3643     0.2002     0.2264  0.0002149   0.8847       0.3764
   1   rs4970405    1038818     3645    -0.4994     0.3404  0.0005903   -1.467       0.1425
   1  rs12726255    1039813     3645    -0.4515     0.2956  0.0006398   -1.527       0.1268
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc comma
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv
turnersd@provolone:~/tmp$ cat plink.qassoc.csv
CHR,SNP,BP,NMISS,BETA,SE,R2,T,P
1,rs3094315,742429,3643,-0.2461,0.2703,0.0002275,-0.9102,0.3628
1,rs12562034,758311,3644,-0.1806,0.3315,8.149e-05,-0.5448,0.5859
1,rs3934834,995669,3641,0.04591,0.2822,7.271e-06,0.1627,0.8708
1,rs9442372,1008567,3645,0.1032,0.2063,6.868e-05,0.5002,0.6169
1,rs3737728,1011278,3644,0.1496,0.2268,0.0001195,0.6598,0.5094
1,rs6687776,1020428,3645,-0.5378,0.2818,0.000999,-1.909,0.05639
1,rs9651273,1021403,3643,0.2002,0.2264,0.0002149,0.8847,0.3764
1,rs4970405,1038818,3645,-0.4994,0.3404,0.0005903,-1.467,0.1425
1,rs12726255,1039813,3645,-0.4515,0.2956,0.0006398,-1.527,0.1268
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc tab
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv  plink.qassoc.tab
turnersd@provolone:~/tmp$ cat plink.qassoc.tab
CHR     SNP     BP      NMISS   BETA    SE      R2      T       P
1       rs3094315       742429  3643    -0.2461 0.2703  0.0002275       -0.9102 0.3628
1       rs12562034      758311  3644    -0.1806 0.3315  8.149e-05       -0.5448 0.5859
1       rs3934834       995669  3641    0.04591 0.2822  7.271e-06       0.1627  0.8708
1       rs9442372       1008567 3645    0.1032  0.2063  6.868e-05       0.5002  0.6169
1       rs3737728       1011278 3644    0.1496  0.2268  0.0001195       0.6598  0.5094
1       rs6687776       1020428 3645    -0.5378 0.2818  0.000999        -1.909  0.05639
1       rs9651273       1021403 3643    0.2002  0.2264  0.0002149       0.8847  0.3764
1       rs4970405       1038818 3645    -0.4994 0.3404  0.0005903       -1.467  0.1425
1       rs12726255      1039813 3645    -0.4515 0.2956  0.0006398       -1.527  0.1268

Wednesday, April 21, 2010

Unix and Perl for Biologists

This looks like a must-read for anyone starting out in computational biology without extensive experience at the command line. The 135-page document linked at the bottom of the Google Group page looks like an excellent primer with lots of examples that could probably be completed in a day or two, and provides a great start for working in a Linux/Unix environment and programming with Perl. This started out as a graduate student course at UC Davis, and is now freely available for anyone who wants to learn Unix and Perl. Also, don't forget about the printable linux command line cheat sheet I posted here long ago.

Google Groups: Unix and Perl for Biologists

Tuesday, September 8, 2009

Get the full path to a file in Linux / Unix

In the last post I showed you how to point to a file in windows and get the full path copied to your clipboard. I wanted to come up with something similar for a Linux environment. This is helpful on Vampire/ACCRE because you have to fully qualify the path to every file you use when you submit jobs with PBS scripts. So I wrote a little perl script:

#!/usr/bin/perl
chomp($pwd=`pwd`);
print "$pwd\n" if @ARGV==0;
foreach (@ARGV) {print "$pwd/$_\n";}

You can copy this from me, just put it in your bin directory, like this:

cp /home/turnersd/bin/path ~/bin

Make it executable, like this:

chmod +x ~/bin/path

Here it is in action. Let's say I wanted to print out the full path to all the .txt files in the current directory. Call the program with arguments as the files you want to print the path to:

[turnersd@vmps21 replicated]$ ls
parseitnow.pbs
parsing_program.pl
replic.txt
tophits.txt

[turnersd@vmps21 replicated]$ path *.txt
/projects/HDL/epistasis/replicated/replic.txt
/projects/HDL/epistasis/replicated/tophits.txt

Sure, it's only a little bit quicker than typing pwd, copying that, then spelling out the filenames. But if you have long filenames or lots of filenames you want to copy, this should get things done faster. Enjoy.

Tuesday, March 24, 2009

Write Your First Perl Script

And it will do way more than display "Hello, world!" to the screen. An anonymous commenter on one of our Linux posts recently asked how to write scripts that will automate the same analysis on multiple files. While there are potentially several ways to do this, perl will almost always get the job done. Here I'll pose a situation you may run into where perl may help, and we'll break down what this simple perl script does, line by line.

Let's say you want to run something like plink or MDR, or any other program that you have to call at the command line with a configuration or batch file that contains the details of the analysis. For this simple case, let's pretend we're running MDR on a dataset we've stratified 3 ways by race. We have three MDR config files: afr-am.cfg, caucsian.cfg, and hispanic.cfg that will each run MDR on the respective dataset. Now, we could manually run MDR three separate times, and in fact, in this scenario that may be easier. But when you need to run dozens, or maybe hundreds of analyses, you'll need a way to automate things. Check out this perl script, and I'll explain what it's doing below. Fire up something like nano or pico, copy/paste this, and save the file as "runMDR.pl"

foreach $current_cfg (@ARGV)
{
# This will run sMDR
`./sMDR $current_cfg`;
}
# Hooray, we're done!

Now, if you call this script from the command line like this, giving the config files you want to run as arguments to the script, it will run sMDR on all three datasets, one after the other:

> perl runMDR.pl afr-am.cfg caucasian.cfg hispanic.cfg

You could also use the askerisk to pass everything that ends with ".cfg" as an argument to the script:

> perl runMDR.pl *.cfg

Okay, let's break this down, step by step.

First, some syntax. Perl ignores everything on a line after the # sign, so this way you can comment your code, so you can remember what it does later. The little ` things on the 4th line are backticks. Those are usually above your tab key on your keyboard. And that semicolon is important.
@ARGV is an array that contains the arguments that you pass to the program (the MDR config files here), and $current_config is a variable that assumes each element in @ARGV, one at a time.
Each time $current_config assumes a new identity, perl will execute the code between the curly braces. So the first time, perl will execute `./sMDR afr-am.cfg`; The stuff between the backticks is executed exactly as if you were typing it into the shell yourself. Here, I'm assuming you have sMDR and afr-am.cfg in the current directory.
Once perl executes the block of code between the braces for each element of @ARGV, it quits, and now you'll have results for all three analyses.

A few final thoughts... If the stuff you're automating is going to take a while to complete, you may consider checking out Greg's previous tutorial on screen. Next, if whatever program you're running over and over again displays output to the screen, you'll have to add an extra line to see that output yourself, or write that output to a file. Also, don't forget your comments! Perl can be quick to write but difficult to understand later on, so comment your scripts well. Finally, if you need more help and you can't find it here or here, many of the folks on this hall have used perl for some time, so ask around!

This blog has moved!