Getting Genetics Done: Split up a GWAS dataset (PED/BED) by Chromosome

Thursday, November 11, 2010

Split up a GWAS dataset (PED/BED) by Chromosome

As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags.

I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:

14 comments:

Larry_ParnellNovember 12, 2010 at 8:59 AM
Thanks, Will and Stephen! I have shared this with my colleagues who are in the midst of our GWAS analysis.
ReplyDelete
Replies
UnknownNovember 12, 2010 at 9:26 AM
This is very helpful Stephen. I am getting ready to work through the tutorial myself and my first question. now that I have used your script, is how do I go from plink bed files to merlin format? Are the plink ped format and merlin format compatible and if so, wouldn't it be possible to use the --recode option in your script as opposed to the make-bed option? Thanks for this and all of your great posts.
ReplyDelete
Replies
Stephen TurnerNovember 12, 2010 at 12:45 PM
Brad, good question. It looks like ped and merlin formats are nearly the same. See the Merlin website. To get ped/map file from a binary ped file, use

plink --bfile basename --recode --out --outputname

The mapfile needs to be only three columns, so use

gawk '{print $1,$2,$4}' data.map > data.map3

There's one other file you'll need, called the data file, described on the Merlin page under "Describing the pedigree file". All you need here is one row per data item in the pedigree file, indicating the data type (encoded as M - marker, A - affection status, T - Quantitative Trait and C - Covariate) and providing a one-word label for each item. So if your pedigree data has a quantitative trait as the phenotype (or no phenotype), and the only thing else in the pedfile are SNPs, you can use this command to make a .dat file that you'll need for mach:

gawk 'BEGIN {print "T","pheno";}{print "M",$2}' data.map

That will make a new .dat file that looks like this:
T pheno
M rs3094315
M rs12562034
M rs3934834
M rs9442372
M rs3737728
....... and so on.
ReplyDelete
Replies
UnknownNovember 15, 2010 at 4:31 AM
When I split, I also split at the centromeres. More friendly to the memory-hungry imputation jobs. Given that modern computers have so many cores, this means I can run more jobs concurrently.
ReplyDelete
Replies
joshuluxJanuary 5, 2011 at 10:52 PM
Just wanted to note that if you are on a multiple-processor machine, you can easily parallelized this task by using the GNU Parallel command (which you can install from http://www.gnu.org/software/parallel/).

The command would be something like:
seq 1 22 | parallel -j +0 plink --nonfounders --allow-no-sex --noweb --bfile inbase --chr {} --make-bed --out outbase

Note also that you can replicate the behavior of your perl script using a one-line bash command:
for chr in `seq 1 22`; do plink --nonfounders --allow-no-sex --noweb --bfile inbase --chr ${chr} --make-bed --out outbase; done

Just thought it might be helpful to point out other ways to get such things done -- it's always good to have more options available!
ReplyDelete
Replies
AnonymousMarch 21, 2011 at 2:34 AM
Do you need to remove the 6th column from the plink ped file? MACH's ped files only has 5 columns.
ReplyDelete
Replies
Statistical GeneticsMarch 28, 2011 at 2:51 AM
Dear Will and Stephen, I am a Masters student from Canada. I am statistician looking to come up with new methods in the field of Genetics. I am taking a genetic course and for my final project I need to do some data analysis and am looking for a GWAS data. I kindly request you to suggest me some website from which I can download them or in some case if you have some genotype data I kindly request to mail me to pichiksc@gmail.com.
ReplyDelete
Replies
AnonymousMay 2, 2011 at 5:07 PM
Hi Friends,

I have a data i following format an looking for a script to convert it in to merlin format. Please help

SNP Chromosome Position ind1 ind2 ind3 ind4 ind5 ind6 ind7
rs10458597 1 564621 CC CC CC CC CC CC CC
rs12565286 1 721290 GG GG GG GG GG GG GG
rs12082473 1 740857 GG GG GG GG GG GG GG
rs3094315 1 752566 AA AA GA AA AA GA AA
rs2286139 1 761732 TT TT CT TT TT CT TT
rs11240776 1 765269 AA AA AA AA AA AA AA
rs2980319 1 777122 TT TT AT TT TT AT TT
rs2980300 1 785989 CC CC TC CC CC TC CC
rs2905036 1 792480 TT TT TT TT TT TT TT
rs11240777 1 798959 GG GG AG GG GG AG GG
rs4245756 1 799463 CC CC CC CC CC CC CC
rs3748597 1 888659 CT CC CC CC CT CC CT
rs2341354 1 918573 AG GG AG GG AG AG AG
rs4970403 1 926431 TT TT TT TT TT TT TT
rs2465126 1 947034 AA AA AA AA AA AA AA

thanks and best regards,

Bio
ReplyDelete
Replies
AdityaJune 1, 2011 at 6:25 AM
Hi,
I am graduate student analyzing Affymatrix array 6.0 data. I have done phasing using both fastphase and PHASE programme. Now, I want to use this phase output data to prepare sweep input file. Please, can you show me the way to do that.
Thanks
ReplyDelete
Replies
AnonymousJuly 17, 2011 at 8:18 PM
I am trying to meta-analysis of GWAS data. But I have a big troble in managing the raw GWAS data.
I can not convert the raw GWAS data which I downloaded from the site, http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, to ped and map file for PLINK analysis.
The file name is like this, phs000202.pha002867.txt.

Could you tell me how to convert "phs000202.pha002867.txt" file to ped and map file (such as phs000202.pha002867.ped and phs000202.pha002867.map) for GWAS meta-analysis using PLINK.
ReplyDelete
Replies
Maki ViNovember 13, 2012 at 2:12 PM
merci Stephan! superthanks..
ReplyDelete
Replies
D174November 4, 2013 at 4:13 AM
Does this script work if your genome data set isn't human and your chromosome names are, for example "2R", "2L", etc.?
ReplyDelete
Replies
UnknownJuly 30, 2014 at 10:27 AM
Hi,

there is an undocumented option in PLINK that splits the data per chr and creates input files for imputation in BEAGLE.

Cheers,

Mitja
ReplyDelete
Replies
UnknownJuly 30, 2014 at 10:29 AM
Hi,

there is an undocumented option in PLINK v1.8 --recode-beagle, that splits the data per chr and creates input files for imputation in BEAGLE.

Cheers,

Mitja
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Thursday, November 11, 2010

Split up a GWAS dataset (PED/BED) by Chromosome

14 comments: