Getting Genetics Done: How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

Thursday, April 21, 2011

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.

First, you'll need a Linux system, and two utilities: tabix and vcftools.

I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:

sudo apt-get install g++

sudo apt-get install zlib1g-dev

First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2

tar -jxvf tabix-0.2.3.tar.bz2

cd tabix-0.2.3/ 

make

Now do the same thing for vcf tools:

wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz

tar -zxvf vcftools_v0.1.4a.tar.gz 

cd vcftools_0.1.4a/

make

The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.

Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.

First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.

./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.

./vcftools --vcf genotypes.vcf --freq --out allelefrequencies

Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.

./vcftools --vcf genotypes.vcf --plink --out plinkformat

That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

34 comments:

scut monkeyApril 22, 2011 at 7:42 AM
Thank you very much, Stephen! This is extremely helpful.
ReplyDelete
Replies
AnonymousApril 23, 2011 at 3:03 PM
Extremely helpful! Thank you!
ReplyDelete
Replies
AnonymousMay 15, 2011 at 3:34 PM
Thanks! This was incredibly easy to follow and useful!
ReplyDelete
Replies
WillMay 19, 2011 at 9:08 AM
One interesting point -- you are pulling allele frequencies across all populations in the 1000 genomes data. You should be able to filter individuals in the vcf tools...
ReplyDelete
Replies
AnonymousJuly 15, 2011 at 10:30 AM
great! thanks!
ReplyDelete
Replies
AnonymousJuly 17, 2011 at 8:25 PM
I am trying to meta-analysis of GWAS data. But I have a big troble in managing the raw GWAS data.
I can not convert the raw GWAS data which I downloaded from the site, http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, to ped and map file for PLINK analysis.
The file name is like this, phs000202.pha002867.txt.

Could you tell me how to convert "phs000202.pha002867.txt" file to ped and map file (such as phs000202.pha002867.ped and phs000202.pha002867.map) for GWAS meta-analysis using PLINK.
ReplyDelete
Replies
JohnSeptember 27, 2011 at 2:14 PM
Hi Stephen,
You have no idea how helpful was this post to me as a person who have no Linux experience. I am running Ubuntu in windows 7 just like you. I have run the commands as mentioned above, but I am not able to proceed after this:
tar -jxvf tabix-0.2.3.tar.bz2

The above command is giving me errors. Should I change something here? -jxvf should be something else?

Thanks.
John.
ReplyDelete
Replies
AnonymousSeptember 27, 2011 at 2:41 PM
Its working now...ignore my above comment.
ReplyDelete
Replies
JohnSeptember 27, 2011 at 4:41 PM
The vcftools commands are not working as mentioned above...any mistake or should I make some changes before using it?
ReplyDelete
Replies
Stephen TurnerSeptember 29, 2011 at 5:28 AM
John - try using a su account when compiling vcftools, i.e. "sudo make"
ReplyDelete
Replies
JohnSeptember 29, 2011 at 11:52 PM
Stephen,
I am able to create the genotypes file, but not able to do anything with vcf tools even after running it from vcf tools folder by changing the directory to vcf tools. Also, what files do I pull to the folder where I do the analysis...I am talking about binary files for both vcf and tabix.
Thanks.
ReplyDelete
Replies
AnonymousNovember 6, 2011 at 9:20 AM
Thank-you so much!
ReplyDelete
Replies
Chloe and RuthDecember 6, 2011 at 11:21 AM
Hi Stephen
Thank you for this very useful tutorial. We've managed to follow your steps to extract out allele frequency. We would also be interested in extracting the corresponding RS number. Is there a command for this? Also, we want to download data for the whole genome. Is this possible and how would we go about doing this?
Thanks
ReplyDelete
Replies
Paul DennyFebruary 1, 2012 at 4:09 PM
Dear Stephen
fantastic tutorial - I've tried to pull selected data from a local .vcf file that I downloaded for a specific genomic interval, but the new .vcf file written contains no data - only all the header rows, plus the column labels in a standard VCF4.0 file. My starting file uses the #CHR names chr1, chr2, etc rather than 1, 2, 3, etc., which I've allowed for, so I'm stuck. any ideas welcomed.
ReplyDelete
Replies
Stephen TurnerFebruary 2, 2012 at 8:12 AM
Paul - I'm really not sure what's going on without some example data. I would suggest posting your question over at Seqanswers.com. They're a great community for helping you troubleshoot issues like this.
ReplyDelete
Replies
BramFebruary 20, 2012 at 11:41 AM
Hi Stephen,

Thanks for this wonderful post ! The approach to get partial vcfs and convert them to plink format works great ! Your last comment was regarding genotypes for specific ethnic groups. I have downloaded several subsection for europeans. Converting them to plink format was no problem, however I do encounter problems when calculating LD between a given SNP in a subsection, and the rest of of the SNPs in a subsection. This did not occur when I tested it with data from the combined populations. Specifically, I get only output of r2 values for 3 SNPs (from a 2 MB region), as opposed to several 1000's for combined population data (same region). Moreover, PLINK indicates that sex can not be identified, which might be the problem, however the --allow-no-sex option doesn't solve this. Any idea how to get LD calculations working for specific ethnic groups working ? I'm really stuck, I do appreciate your tutorial here very much !
ReplyDelete
Replies
Stephen TurnerFebruary 20, 2012 at 12:03 PM
Bram - there are two approaches if I understand you correctly. You can download the combined population and use Plink's --keep or --remove to keep or remove certain samples, or you could download the ethnic-specific data and include everyone. These should do the same thing. Have you tried both approaches?
ReplyDelete
Replies
BramMarch 30, 2012 at 6:36 AM
Hi Stephen,

To come back the discussion, our aim was to calculate LD between a given set of SNPs and SNPs in 1000 Genomes data (European dataset in supporting directory, PHASE I data). Turns out that VCFtools can't properly convert data from this subset to PLINK data, making usage of PLINK / Haploview impossible. Instead, we've performed LD calculations using VCFtools itself, though computationally intensive, which works fine. Interestingly, converting to PED files works fine for the dataset of all populations combined (not subsets).
ReplyDelete
Replies
AnonymousApril 26, 2012 at 12:18 PM
Hi Dr. Turner,

I found your post to be extremely helpful. I was mainly interested in obtaining the frequency data.

Is there a way to specify multiple regions in the ./tabix command that you used? For example, I would like to provide only exons from ~10 genes on different chromosomes. Is there a command that can tell tabix to read a bed file that has all of the coordinates?

Thanks in advance,
Mike
ReplyDelete
Replies
Stephen TurnerApril 26, 2012 at 3:16 PM
Not that I know of but writing a script to do this would be fairly trivial.
ReplyDelete
Replies
AnonymousJuly 23, 2012 at 9:59 AM
Hello,
I would just like to say thank you for this incredibly helpful post! I was getting stuck on just what to install... Very helpful tutorial.
ReplyDelete
Replies
charlesJuly 31, 2012 at 11:13 AM
hello i ran the ccommand --remove-filtered-geno-all and this is what i got. please can any one help me by explaining what this means.

Currently scanning CHROM: chr1
Currently scanning CHROM: chr1_random
Currently scanning CHROM: chr2
Currently scanning CHROM: chr3
Currently scanning CHROM: chr3_random
Currently scanning CHROM: chr4
Currently scanning CHROM: chr4_random
Currently scanning CHROM: chr5
Currently scanning CHROM: chr5_random
Currently scanning CHROM: chr6
Currently scanning CHROM: chr7
Currently scanning CHROM: chr7_random
Currently scanning CHROM: chr8
Currently scanning CHROM: chr9
Currently scanning CHROM: chr9_random
Currently scanning CHROM: chr10
Currently scanning CHROM: chr10_random
Currently scanning CHROM: chr11
Currently scanning CHROM: chr11_random
Currently scanning CHROM: chr12
Currently scanning CHROM: chr12_random
Currently scanning CHROM: chr13
Currently scanning CHROM: chr13_random
Currently scanning CHROM: chr14
Currently scanning CHROM: chr15
Currently scanning CHROM: chr16
Currently scanning CHROM: chr16_random
Currently scanning CHROM: chr17
Currently scanning CHROM: chr17_random
Currently scanning CHROM: chr18
Currently scanning CHROM: chr18_random
Currently scanning CHROM: chr19
Currently scanning CHROM: chrUn
Keeping 579457 entries (out of 579457 read)
Done
Filtering out all genotypes with FILTER flag.
Kept 6 out of 6 Individuals
Kept 579457 out of 579457 Sites
Run Time = 11.00 seconds
ReplyDelete
Replies
Juan CalderonSeptember 5, 2012 at 8:33 PM
Stephen,

First of all thanks for such a helpful resource in your blog! I am analyzing some WG data from the Sanger Wellcome Institute. I am able to retrieve the VCF files for a specific chromosome but vcftools wouldnt analyze my vcf file because it comes in vcf 3.3 format as opposed to 4.0 or 4.1. Is there any tool i could use to transform my data set into this newer format? Thanks a lot,

Juan
ReplyDelete
Replies
UnknownSeptember 24, 2012 at 5:38 PM
Hi Juan,
I haven't tried this myself, but there is a vcftools perl module to convert between different VCF formats:
http://vcftools.sourceforge.net/perl_module.html#vcf-convert

~Sarah
ReplyDelete
Replies
AnonymousNovember 7, 2012 at 12:23 PM
This might be helpful for someone who are trying to work with the 1000 genome data,
http://vat.gersteinlab.org/workflow.php

-Sri
ReplyDelete
Replies
UnknownMay 28, 2013 at 1:03 AM
Hi Stephen,

Thanks for the great tutorial.. Extremely useful.
ReplyDelete
Replies
UnknownNovember 12, 2013 at 3:20 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownNovember 12, 2013 at 3:38 PM
Hi Stephen,

Is there any way not to get indels while fetching the data using tabix ? Or later filtering of the vcf file is the only option ?
ReplyDelete
Replies
AnonymousMarch 16, 2014 at 2:19 PM
Hi Stephen and thanks for this useful post.
Instead selecting a particular region (like the CETP region), is there a way to download a specific set of SNPs around the genome (let's say 100) using tabix and vcf tools?
Thank you really much.
ReplyDelete
Replies
UnknownMarch 16, 2015 at 12:02 PM
Hi Stephen,

thank you for the useful post, i would like to know the allele frequencies out put file description, which is like
CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
scaffold1|size923514 3945 2 1 A:0 C:1
scaffold1|size923514 19952 2 1 C:0 T:1
scaffold1|size923514 19961 2 1 A:0 G:1
scaffold1|size923514 19971 2 1 ACTCG:0 GCTCA:1
scaffold1|size923514 19995 2 1 G:0 A:1
scaffold1|size923514 20043 2 1 C:0 G:1
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Thursday, April 21, 2011

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

34 comments: