First, you'll need a Linux system, and two utilities: tabix and vcftools.
I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:
sudo apt-get install g++
sudo apt-get install zlib1g-dev
First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.
wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2
tar -jxvf tabix-0.2.3.tar.bz2
cd tabix-0.2.3/
make
Now do the same thing for vcf tools:
wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz
tar -zxvf vcftools_v0.1.4a.tar.gz
tar -zxvf vcftools_v0.1.4a.tar.gz
cd vcftools_0.1.4a/
make
make
The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.
Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.
First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.
./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf
Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.
./vcftools --vcf genotypes.vcf --freq --out allelefrequencies
Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.
./vcftools --vcf genotypes.vcf --plink --out plinkformat
That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

Thank you very much, Stephen! This is extremely helpful.
ReplyDeleteExtremely helpful! Thank you!
ReplyDeleteThanks! This was incredibly easy to follow and useful!
ReplyDeleteOne interesting point -- you are pulling allele frequencies across all populations in the 1000 genomes data. You should be able to filter individuals in the vcf tools...
ReplyDeletegreat! thanks!
ReplyDeleteI am trying to meta-analysis of GWAS data. But I have a big troble in managing the raw GWAS data.
ReplyDeleteI can not convert the raw GWAS data which I downloaded from the site, http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, to ped and map file for PLINK analysis.
The file name is like this, phs000202.pha002867.txt.
Could you tell me how to convert "phs000202.pha002867.txt" file to ped and map file (such as phs000202.pha002867.ped and phs000202.pha002867.map) for GWAS meta-analysis using PLINK.
Hi Stephen,
ReplyDeleteYou have no idea how helpful was this post to me as a person who have no Linux experience. I am running Ubuntu in windows 7 just like you. I have run the commands as mentioned above, but I am not able to proceed after this:
tar -jxvf tabix-0.2.3.tar.bz2
The above command is giving me errors. Should I change something here? -jxvf should be something else?
Thanks.
John.
Its working now...ignore my above comment.
ReplyDeleteThe vcftools commands are not working as mentioned above...any mistake or should I make some changes before using it?
ReplyDeleteJohn - try using a su account when compiling vcftools, i.e. "sudo make"
ReplyDeleteStephen,
ReplyDeleteI am able to create the genotypes file, but not able to do anything with vcf tools even after running it from vcf tools folder by changing the directory to vcf tools. Also, what files do I pull to the folder where I do the analysis...I am talking about binary files for both vcf and tabix.
Thanks.
Thank-you so much!
ReplyDeleteHi Stephen
ReplyDeleteThank you for this very useful tutorial. We've managed to follow your steps to extract out allele frequency. We would also be interested in extracting the corresponding RS number. Is there a command for this? Also, we want to download data for the whole genome. Is this possible and how would we go about doing this?
Thanks
Dear Stephen
ReplyDeletefantastic tutorial - I've tried to pull selected data from a local .vcf file that I downloaded for a specific genomic interval, but the new .vcf file written contains no data - only all the header rows, plus the column labels in a standard VCF4.0 file. My starting file uses the #CHR names chr1, chr2, etc rather than 1, 2, 3, etc., which I've allowed for, so I'm stuck. any ideas welcomed.
Paul - I'm really not sure what's going on without some example data. I would suggest posting your question over at Seqanswers.com. They're a great community for helping you troubleshoot issues like this.
ReplyDeleteHi Stephen,
ReplyDeleteThanks for this wonderful post ! The approach to get partial vcfs and convert them to plink format works great ! Your last comment was regarding genotypes for specific ethnic groups. I have downloaded several subsection for europeans. Converting them to plink format was no problem, however I do encounter problems when calculating LD between a given SNP in a subsection, and the rest of of the SNPs in a subsection. This did not occur when I tested it with data from the combined populations. Specifically, I get only output of r2 values for 3 SNPs (from a 2 MB region), as opposed to several 1000's for combined population data (same region). Moreover, PLINK indicates that sex can not be identified, which might be the problem, however the --allow-no-sex option doesn't solve this. Any idea how to get LD calculations working for specific ethnic groups working ? I'm really stuck, I do appreciate your tutorial here very much !
Bram - there are two approaches if I understand you correctly. You can download the combined population and use Plink's --keep or --remove to keep or remove certain samples, or you could download the ethnic-specific data and include everyone. These should do the same thing. Have you tried both approaches?
ReplyDeleteHi Stephen !
DeleteWow, I didn't expect such as fast reply. I tried the last option, including everyone but from ethnic specific data (EUR). I think perhaps the problem could be that there's 283 individuals for the european population, limiting the ability to calculate LD over large distances as compared to larger samplesizes. I will try your other suggestion as well.
As I know, VCFtools can basically perform the same, so I will try this as well, and get back to this in this thread if successful.
Anyway, without this post I wouldn't be able so fast to figure all this out, so hereby thanks again, also for your fast reply !
Hi Stephen,
ReplyDeleteTo come back the discussion, our aim was to calculate LD between a given set of SNPs and SNPs in 1000 Genomes data (European dataset in supporting directory, PHASE I data). Turns out that VCFtools can't properly convert data from this subset to PLINK data, making usage of PLINK / Haploview impossible. Instead, we've performed LD calculations using VCFtools itself, though computationally intensive, which works fine. Interestingly, converting to PED files works fine for the dataset of all populations combined (not subsets).
Hi Dr. Turner,
ReplyDeleteI found your post to be extremely helpful. I was mainly interested in obtaining the frequency data.
Is there a way to specify multiple regions in the ./tabix command that you used? For example, I would like to provide only exons from ~10 genes on different chromosomes. Is there a command that can tell tabix to read a bed file that has all of the coordinates?
Thanks in advance,
Mike
In case you have not found this already, you can specify the -B option to tabix followed by a bed file, e.g. tabix fileName.gz -B test.bed
DeleteNot that I know of but writing a script to do this would be fairly trivial.
ReplyDelete