A developer in Will's lab, Justin Giles, wrote a Perl script which utilizes one of PLINK's advanced features, --genome-lists, which takes two files as arguments. You can read about this feature under the advanced hint section of the PLINK documentation of IBS clustering. Each of these files contain lists of family IDs and individual IDs of samples for whom you'd like to calculate IBD. In other words, you can break up the IBD calculations by groups of samples, instead of requiring a single process to do it all. The Perl script also takes the base filename of your binary pedfile and parses the .fam file to split up the list of individuals into small chunks. The size of these chunks are specified by the user. Assuming you have access to a high-performance computing cluster using Torque/PBS for scheduling and job submission, the script also writes out PBS files that can be used to submit each of the segments to a node on a supercomputing cluster (although this can easily be modified to fit other parallelization frameworks, so modify the script as necessary). The script also needs all the minor allele frequencies (which can easily be attained with the --freq option in PLINK).
One of the first things the script does is parses and splits up your fam file into chunks of N individuals (where N is set by the user - I used 100 and estimation only took ~15 minutes). This can be accomplished by a simple gawk command:
gawk '{print $1,$2}' data.fam | split -d -a 3 -l 100 - tmp.list
Then the script sets up some PBS scripts (like shell scripts) to run PLINK commands:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list001 tmp.list001 --out data.sub.1.1 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list001 tmp.list002 --out data.sub.1.2 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list001 tmp.list003 --out data.sub.1.3 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list001 tmp.list004 --out data.sub.1.4 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list002 tmp.list002 --out data.sub.2.2 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list002 tmp.list003 --out data.sub.2.3 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list002 tmp.list004 --out data.sub.2.4 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list003 tmp.list003 --out data.sub.3.3 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list003 tmp.list004 --out data.sub.3.4 | |
plink --bfile basename --read-freq mymafs.frq --genome --min 0.05 --genome-lists tmp.list004 tmp.list004 --out data.sub.4.4 |
At which point you'll have the output files...
data.sub.1.1.genome
data.sub.1.2.genome
data.sub.1.3.genome
data.sub.1.4.genome
data.sub.2.2.genome
data.sub.2.3.genome
data.sub.2.4.genome
data.sub.3.3.genome
data.sub.3.4.genome
data.sub.4.4.genome
...that you can easily concatenate.
Here's the perl script below. To run it, give it the full path to your binary pedfile, the number of individuals in each "chunk" to infer IBD between, and the fully qualified path to your .frq file that you get from running plink --freq. If you're not using PBS to submit jobs, you'll have to modify the code a little bit in the main print statement in the middle. If you're not running this in your /scratch/username/ibd directory, you'll want to change that on line 57. You'll also want to change your email address on line 38 if you want to receive emails from your scheduler if you use PBS.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/perl | |
# | |
#Parses a .fam file and generates N number of sample list files depending on 2nd argument | |
#Generates PBS cluster submission files | |
#Justin Giles, Vanderbilt University, 2010 | |
# | |
use strict; | |
if(scalar(@ARGV) != 3){ | |
print "\nPLEASE NOTE: EDIT THIS SCRIPT IN THE PBS AREA TO MODIFY IT TO YOUR NEEDS\n"; | |
print "parallelize_plink_ibd.pl <bedfile_base> <chunksize> <PLINK_frq_file>\n"; | |
print "\nExample:\n"; | |
print "parallelize_plink_ibd.pl /scratch/turnersd/dataprefix 100 /scratch/turnersd/freq.frq\n"; | |
exit; | |
} | |
my $bfile_base=$ARGV[0]; | |
my $NUM_IN_FILE=$ARGV[1]; | |
my $freq_file = $ARGV[2]; | |
my $num_fam=`wc $bfile_base.fam | awk '{ print $1 }'`; | |
`gawk '{print \$1,\$2}' $bfile_base.fam | split -d -a 3 -l $NUM_IN_FILE - tmp.list`; | |
my $path = `pwd`; | |
chomp($path); | |
my $i=0; | |
my $a=`ls tmp* | wc | awk '{ print $1 }'`-1; | |
my $j=0; | |
while( $i <= $a ){ | |
while($j <= $a ){ | |
my $file = "submitme_" . $i . "_" . $j . ".pbs"; | |
open(OUT, ">$file") || die "Cannot open $file\n"; | |
print OUT "#!/bin/ksh\n | |
##EDIT THIS TO YOUR EMAIL## | |
#PBS -M username\@domain.com | |
#PBS -m bae | |
##CHANGE THIS IF YOU NEED MORE NODES FOR THIS PROCESS## | |
#PBS -l nodes=1 | |
##CHANGE WALLTIME AND CPUT TO MATCH THE LENGTH OF THIS PROCESS## | |
#PBS -l walltime=10:00:00\n | |
#PBS -l cput=10:00:00\n | |
##CHANGE THE PMEM AND MEM TO MATCH THE MEMORY REQUIREMENTS OF THIS PROCESS## | |
#PBS -l pmem=7000mb\n | |
#PBS -l mem=7000mb\n | |
#PBS -j oe\n | |
## CHANGE THE GROUP TO MATCH YOUR GROUP## | |
#PBS -W group_list=YOUR_GROUP_NAME | |
dir=/scratch/\$USER/ibd/\n | |
[ -d \$dir ] || mkdir \$dir | |
cd \$dir/\n\n"; | |
my $i_file = `printf "%03i\n" $i`; | |
my $j_file = `printf "%03i\n" $j`; | |
chomp($i_file); | |
chomp($j_file); | |
print OUT "plink --bfile $bfile_base \\ | |
--read-freq $freq_file \\ | |
--genome --min 0.05 \\ | |
--genome-lists $path/tmp.list$i_file \\ | |
$path/tmp.list$j_file \\ | |
--out data.sub.$i.$j\n"; | |
close(OUT); | |
$j=$j+1; | |
} | |
$i=$i+1; | |
$j=$i; | |
} |
After you submit all these jobs, you can very easily run these commands to concatenate the results and clean up the temporary files:
cat data.sub.*genome > results.genome
rm tmp.list*
rm data.sub.*
Always nice to see some parallelization! Just a remark: I think the script gets it right, but in your example of output files you're missing data.sub.1.1.genome, 2.2, 3.3 and so on.
ReplyDeleteThere is actually some documentation on --genome-lists, hidden away in the IBS clustering section. Scroll down about one page from http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#cluster to the "advanced hint".
Robert - thanks for the heads up! I just updated a few parts above. I missed the --genome-lists documentation because I assumed it would be in the command line reference table, but I suppose they hide some of the "advanced" features from this list.
ReplyDelete