I'm working on imputing GWAS data to the 1000 Genomes Project data using MaCH. For the model estimation phase you only need ~200 individuals. Here's a one-line unix command that will pull out 200 samples at random from a
binary pedigree .fam file called myfamfile.fam:
for i in `cut -d ' ' -f 1-2 myfamfile.fam | sed s/\ /,/g`; do echo "$RANDOM $i"; done | sort | cut -d' ' -f 2| sed s/,/\ /g | head -n 200
Redirect this output to a file, and then run PLINK using the
--keep option with this new file.
Stephen,
ReplyDeleteI've imputed with the 1000 Genomes data for our PGRN project as well.
One thing you should DEFINITELY keep in mind is to NOT use the 'compact' option in MACH. I tried it a couple of times out of curiosity and after giving Chrom 22 Step 1 > 2days it was apparent that it just wasn't going to happen.
I'm also a bit superstitious about running FROMDOS on everything as a precaution, there has definitely been a lot of time wasted around here just due to Windows leaving its dirty fingerprints on stuff. . .
IIRC, RAM consumption varied from 2-6GB so make sure you've got plenty of overhead there.
I didn't have good luck running these as jobs on our cluster, maybe a memory management thing on the nodes, so I just resorted to running everything locally on my workstation leaving a few cores for normal daily activities.
Stranding was also a bit of a pain. It's nice that the Illumina chip only have about 3500 A/T, C/G SNPs. I decided to remove everything that had a MAF >40% in either 1000 Genomes or my data set and spend the time doing a few PLINK merge-mode -7 tests before I was confident that I'd gotten the stranding worked out
You've saved me more than a couple of times with your posts here, so I definitely owe ya. . .
Thanks,
Mike Baldwin
Clarification
ReplyDeleteThe MAF >40% SNPs that I removed were limited to just the A/T, C/G SNPs on the 610quad