A developer in Will's lab, Justin Giles, wrote a Perl script which utilizes one of PLINK's advanced features, --genome-lists, which takes two files as arguments. You can read about this feature under the advanced hint section of the PLINK documentation of IBS clustering. Each of these files contain lists of family IDs and individual IDs of samples for whom you'd like to calculate IBD. In other words, you can break up the IBD calculations by groups of samples, instead of requiring a single process to do it all. The Perl script also takes the base filename of your binary pedfile and parses the .fam file to split up the list of individuals into small chunks. The size of these chunks are specified by the user. Assuming you have access to a high-performance computing cluster using Torque/PBS for scheduling and job submission, the script also writes out PBS files that can be used to submit each of the segments to a node on a supercomputing cluster (although this can easily be modified to fit other parallelization frameworks, so modify the script as necessary). The script also needs all the minor allele frequencies (which can easily be attained with the --freq option in PLINK).
One of the first things the script does is parses and splits up your fam file into chunks of N individuals (where N is set by the user - I used 100 and estimation only took ~15 minutes). This can be accomplished by a simple gawk command:
gawk '{print $1,$2}' data.fam | split -d -a 3 -l 100 - tmp.list
Then the script sets up some PBS scripts (like shell scripts) to run PLINK commands:
At which point you'll have the output files...
data.sub.1.1.genome
data.sub.1.2.genome
data.sub.1.3.genome
data.sub.1.4.genome
data.sub.2.2.genome
data.sub.2.3.genome
data.sub.2.4.genome
data.sub.3.3.genome
data.sub.3.4.genome
data.sub.4.4.genome
...that you can easily concatenate.
Here's the perl script below. To run it, give it the full path to your binary pedfile, the number of individuals in each "chunk" to infer IBD between, and the fully qualified path to your .frq file that you get from running plink --freq. If you're not using PBS to submit jobs, you'll have to modify the code a little bit in the main print statement in the middle. If you're not running this in your /scratch/username/ibd directory, you'll want to change that on line 57. You'll also want to change your email address on line 38 if you want to receive emails from your scheduler if you use PBS.
After you submit all these jobs, you can very easily run these commands to concatenate the results and clean up the temporary files:
cat data.sub.*genome > results.genome
rm tmp.list*
rm data.sub.*
Always nice to see some parallelization! Just a remark: I think the script gets it right, but in your example of output files you're missing data.sub.1.1.genome, 2.2, 3.3 and so on.
ReplyDeleteThere is actually some documentation on --genome-lists, hidden away in the IBS clustering section. Scroll down about one page from http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#cluster to the "advanced hint".
Robert - thanks for the heads up! I just updated a few parts above. I missed the --genome-lists documentation because I assumed it would be in the command line reference table, but I suppose they hide some of the "advanced" features from this list.
ReplyDelete