As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags.
I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:
Thanks, Will and Stephen! I have shared this with my colleagues who are in the midst of our GWAS analysis.
ReplyDeleteThis is very helpful Stephen. I am getting ready to work through the tutorial myself and my first question. now that I have used your script, is how do I go from plink bed files to merlin format? Are the plink ped format and merlin format compatible and if so, wouldn't it be possible to use the --recode option in your script as opposed to the make-bed option? Thanks for this and all of your great posts.
ReplyDeleteBrad, good question. It looks like ped and merlin formats are nearly the same. See the Merlin website. To get ped/map file from a binary ped file, use
ReplyDeleteplink --bfile basename --recode --out --outputname
The mapfile needs to be only three columns, so use
gawk '{print $1,$2,$4}' data.map > data.map3
There's one other file you'll need, called the data file, described on the Merlin page under "Describing the pedigree file". All you need here is one row per data item in the pedigree file, indicating the data type (encoded as M - marker, A - affection status, T - Quantitative Trait and C - Covariate) and providing a one-word label for each item. So if your pedigree data has a quantitative trait as the phenotype (or no phenotype), and the only thing else in the pedfile are SNPs, you can use this command to make a .dat file that you'll need for mach:
gawk 'BEGIN {print "T","pheno";}{print "M",$2}' data.map
That will make a new .dat file that looks like this:
T pheno
M rs3094315
M rs12562034
M rs3934834
M rs9442372
M rs3737728
....... and so on.
When I split, I also split at the centromeres. More friendly to the memory-hungry imputation jobs. Given that modern computers have so many cores, this means I can run more jobs concurrently.
ReplyDeleteJust wanted to note that if you are on a multiple-processor machine, you can easily parallelized this task by using the GNU Parallel command (which you can install from http://www.gnu.org/software/parallel/).
ReplyDeleteThe command would be something like:
seq 1 22 | parallel -j +0 plink --nonfounders --allow-no-sex --noweb --bfile inbase --chr {} --make-bed --out outbase
Note also that you can replicate the behavior of your perl script using a one-line bash command:
for chr in `seq 1 22`; do plink --nonfounders --allow-no-sex --noweb --bfile inbase --chr ${chr} --make-bed --out outbase; done
Just thought it might be helpful to point out other ways to get such things done -- it's always good to have more options available!
Do you need to remove the 6th column from the plink ped file? MACH's ped files only has 5 columns.
ReplyDeleteDear Will and Stephen, I am a Masters student from Canada. I am statistician looking to come up with new methods in the field of Genetics. I am taking a genetic course and for my final project I need to do some data analysis and am looking for a GWAS data. I kindly request you to suggest me some website from which I can download them or in some case if you have some genotype data I kindly request to mail me to pichiksc@gmail.com.
ReplyDeleteHi Friends,
ReplyDeleteI have a data i following format an looking for a script to convert it in to merlin format. Please help
SNP Chromosome Position ind1 ind2 ind3 ind4 ind5 ind6 ind7
rs10458597 1 564621 CC CC CC CC CC CC CC
rs12565286 1 721290 GG GG GG GG GG GG GG
rs12082473 1 740857 GG GG GG GG GG GG GG
rs3094315 1 752566 AA AA GA AA AA GA AA
rs2286139 1 761732 TT TT CT TT TT CT TT
rs11240776 1 765269 AA AA AA AA AA AA AA
rs2980319 1 777122 TT TT AT TT TT AT TT
rs2980300 1 785989 CC CC TC CC CC TC CC
rs2905036 1 792480 TT TT TT TT TT TT TT
rs11240777 1 798959 GG GG AG GG GG AG GG
rs4245756 1 799463 CC CC CC CC CC CC CC
rs3748597 1 888659 CT CC CC CC CT CC CT
rs2341354 1 918573 AG GG AG GG AG AG AG
rs4970403 1 926431 TT TT TT TT TT TT TT
rs2465126 1 947034 AA AA AA AA AA AA AA
thanks and best regards,
Bio
Hi,
ReplyDeleteI am graduate student analyzing Affymatrix array 6.0 data. I have done phasing using both fastphase and PHASE programme. Now, I want to use this phase output data to prepare sweep input file. Please, can you show me the way to do that.
Thanks
I am trying to meta-analysis of GWAS data. But I have a big troble in managing the raw GWAS data.
ReplyDeleteI can not convert the raw GWAS data which I downloaded from the site, http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, to ped and map file for PLINK analysis.
The file name is like this, phs000202.pha002867.txt.
Could you tell me how to convert "phs000202.pha002867.txt" file to ped and map file (such as phs000202.pha002867.ped and phs000202.pha002867.map) for GWAS meta-analysis using PLINK.
merci Stephan! superthanks..
ReplyDeleteDoes this script work if your genome data set isn't human and your chromosome names are, for example "2R", "2L", etc.?
ReplyDeleteHi,
ReplyDeletethere is an undocumented option in PLINK that splits the data per chr and creates input files for imputation in BEAGLE.
Cheers,
Mitja
Hi,
ReplyDeletethere is an undocumented option in PLINK v1.8 --recode-beagle, that splits the data per chr and creates input files for imputation in BEAGLE.
Cheers,
Mitja