Tuesday, November 23, 2010

Randomly Select Subsets of Individuals from a Binary Pedigree .fam File

I'm working on imputing GWAS data to the 1000 Genomes Project data using MaCH. For the model estimation phase you only need ~200 individuals. Here's a one-line unix command that will pull out 200 samples at random from a binary pedigree .fam file called myfamfile.fam:

for i in `cut -d ' ' -f 1-2  myfamfile.fam | sed s/\ /,/g`; do echo "$RANDOM $i"; done | sort |  cut -d' ' -f 2| sed s/,/\ /g | head -n 200

Redirect this output to a file, and then run PLINK using the --keep option with this new file.

2 comments:

  1. Stephen,
    I've imputed with the 1000 Genomes data for our PGRN project as well.
    One thing you should DEFINITELY keep in mind is to NOT use the 'compact' option in MACH. I tried it a couple of times out of curiosity and after giving Chrom 22 Step 1 > 2days it was apparent that it just wasn't going to happen.
    I'm also a bit superstitious about running FROMDOS on everything as a precaution, there has definitely been a lot of time wasted around here just due to Windows leaving its dirty fingerprints on stuff. . .

    IIRC, RAM consumption varied from 2-6GB so make sure you've got plenty of overhead there.
    I didn't have good luck running these as jobs on our cluster, maybe a memory management thing on the nodes, so I just resorted to running everything locally on my workstation leaving a few cores for normal daily activities.

    Stranding was also a bit of a pain. It's nice that the Illumina chip only have about 3500 A/T, C/G SNPs. I decided to remove everything that had a MAF >40% in either 1000 Genomes or my data set and spend the time doing a few PLINK merge-mode -7 tests before I was confident that I'd gotten the stranding worked out

    You've saved me more than a couple of times with your posts here, so I definitely owe ya. . .

    Thanks,
    Mike Baldwin

    ReplyDelete
  2. Clarification
    The MAF >40% SNPs that I removed were limited to just the A/T, C/G SNPs on the 610quad

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.