Thursday, April 21, 2011

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.

First, you'll need a Linux system, and two utilities: tabix and vcftools.

I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:

sudo apt-get install g++
sudo apt-get install zlib1g-dev

First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2
tar -jxvf tabix-0.2.3.tar.bz2
cd tabix-0.2.3/
make

Now do the same thing for vcf tools:

wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz
tar -zxvf vcftools_v0.1.4a.tar.gz 
cd vcftools_0.1.4a/
make

The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.

Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.

First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.

./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.

./vcftools --vcf genotypes.vcf --freq --out allelefrequencies

Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.

./vcftools --vcf genotypes.vcf --plink --out plinkformat

That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

34 comments:

  1. Thank you very much, Stephen! This is extremely helpful.

    ReplyDelete
  2. Extremely helpful! Thank you!

    ReplyDelete
  3. Thanks! This was incredibly easy to follow and useful!

    ReplyDelete
  4. One interesting point -- you are pulling allele frequencies across all populations in the 1000 genomes data. You should be able to filter individuals in the vcf tools...

    ReplyDelete
  5. I am trying to meta-analysis of GWAS data. But I have a big troble in managing the raw GWAS data.
    I can not convert the raw GWAS data which I downloaded from the site, http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, to ped and map file for PLINK analysis.
    The file name is like this, phs000202.pha002867.txt.

    Could you tell me how to convert "phs000202.pha002867.txt" file to ped and map file (such as phs000202.pha002867.ped and phs000202.pha002867.map) for GWAS meta-analysis using PLINK.

    ReplyDelete
  6. Hi Stephen,
    You have no idea how helpful was this post to me as a person who have no Linux experience. I am running Ubuntu in windows 7 just like you. I have run the commands as mentioned above, but I am not able to proceed after this:
    tar -jxvf tabix-0.2.3.tar.bz2

    The above command is giving me errors. Should I change something here? -jxvf should be something else?

    Thanks.
    John.

    ReplyDelete
  7. Its working now...ignore my above comment.

    ReplyDelete
  8. The vcftools commands are not working as mentioned above...any mistake or should I make some changes before using it?

    ReplyDelete
  9. John - try using a su account when compiling vcftools, i.e. "sudo make"

    ReplyDelete
    Replies
    1. Hi Stephen please i need help
      i got the following error message
      and i do not understand it
      bilbo@ubuntu:~/tabix-0.2.3$ make
      make[1]: Entering directory `/home/bilbo/tabix-0.2.3'
      gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE bgzf.c -o bgzf.o
      bgzf.c: In function ‘bgzf_close’:
      bgzf.c:618:8: warning: variable ‘count’ set but not used [-Wunused-but-set-variable]
      gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE kstring.c -o kstring.o
      gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE knetfile.c -o knetfile.o
      knetfile.c: In function ‘khttp_connect_file’:
      knetfile.c:418:2: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
      knetfile.c: In function ‘kftp_send_cmd’:
      knetfile.c:239:2: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
      gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE index.c -o index.o
      ar -cru libtabix.a bgzf.o kstring.o knetfile.o index.o
      gcc -c -g -Wall -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE main.c -o main.o
      gcc -g -Wall -O2 -fPIC -o tabix main.o -lm -lz -L. -ltabix
      ./libtabix.a(bgzf.o): In function `deflate_block':
      /home/bilbo/tabix-0.2.3/bgzf.c:300: undefined reference to `deflate'
      /home/bilbo/tabix-0.2.3/bgzf.c:302: undefined reference to `deflateEnd'
      /home/bilbo/tabix-0.2.3/bgzf.c:294: undefined reference to `deflateInit2_'
      /home/bilbo/tabix-0.2.3/bgzf.c:318: undefined reference to `deflateEnd'
      /home/bilbo/tabix-0.2.3/bgzf.c:334: undefined reference to `crc32'
      /home/bilbo/tabix-0.2.3/bgzf.c:335: undefined reference to `crc32'
      ./libtabix.a(bgzf.o): In function `inflate_block':
      /home/bilbo/tabix-0.2.3/bgzf.c:368: undefined reference to `inflateInit2_'
      /home/bilbo/tabix-0.2.3/bgzf.c:373: undefined reference to `inflate'
      /home/bilbo/tabix-0.2.3/bgzf.c:379: undefined reference to `inflateEnd'
      /home/bilbo/tabix-0.2.3/bgzf.c:375: undefined reference to `inflateEnd'
      collect2: ld returned 1 exit status
      make[1]: *** [tabix] Error 1
      make[1]: Leaving directory `/home/bilbo/tabix-0.2.3'
      make: *** [all-recur] Error 1
      bilbo@ubuntu:~/tabix-0.2.3$ "sudo make"
      sudo make: command not found
      bilbo@ubuntu:~/tabix-0.2.3$

      Delete
  10. Stephen,
    I am able to create the genotypes file, but not able to do anything with vcf tools even after running it from vcf tools folder by changing the directory to vcf tools. Also, what files do I pull to the folder where I do the analysis...I am talking about binary files for both vcf and tabix.
    Thanks.

    ReplyDelete
  11. Thank-you so much!

    ReplyDelete
  12. Hi Stephen
    Thank you for this very useful tutorial. We've managed to follow your steps to extract out allele frequency. We would also be interested in extracting the corresponding RS number. Is there a command for this? Also, we want to download data for the whole genome. Is this possible and how would we go about doing this?
    Thanks

    ReplyDelete
  13. Dear Stephen
    fantastic tutorial - I've tried to pull selected data from a local .vcf file that I downloaded for a specific genomic interval, but the new .vcf file written contains no data - only all the header rows, plus the column labels in a standard VCF4.0 file. My starting file uses the #CHR names chr1, chr2, etc rather than 1, 2, 3, etc., which I've allowed for, so I'm stuck. any ideas welcomed.

    ReplyDelete
  14. Paul - I'm really not sure what's going on without some example data. I would suggest posting your question over at Seqanswers.com. They're a great community for helping you troubleshoot issues like this.

    ReplyDelete
  15. Hi Stephen,

    Thanks for this wonderful post ! The approach to get partial vcfs and convert them to plink format works great ! Your last comment was regarding genotypes for specific ethnic groups. I have downloaded several subsection for europeans. Converting them to plink format was no problem, however I do encounter problems when calculating LD between a given SNP in a subsection, and the rest of of the SNPs in a subsection. This did not occur when I tested it with data from the combined populations. Specifically, I get only output of r2 values for 3 SNPs (from a 2 MB region), as opposed to several 1000's for combined population data (same region). Moreover, PLINK indicates that sex can not be identified, which might be the problem, however the --allow-no-sex option doesn't solve this. Any idea how to get LD calculations working for specific ethnic groups working ? I'm really stuck, I do appreciate your tutorial here very much !

    ReplyDelete
  16. Bram - there are two approaches if I understand you correctly. You can download the combined population and use Plink's --keep or --remove to keep or remove certain samples, or you could download the ethnic-specific data and include everyone. These should do the same thing. Have you tried both approaches?

    ReplyDelete
    Replies
    1. Hi Stephen !

      Wow, I didn't expect such as fast reply. I tried the last option, including everyone but from ethnic specific data (EUR). I think perhaps the problem could be that there's 283 individuals for the european population, limiting the ability to calculate LD over large distances as compared to larger samplesizes. I will try your other suggestion as well.

      As I know, VCFtools can basically perform the same, so I will try this as well, and get back to this in this thread if successful.

      Anyway, without this post I wouldn't be able so fast to figure all this out, so hereby thanks again, also for your fast reply !

      Delete
  17. Hi Stephen,

    To come back the discussion, our aim was to calculate LD between a given set of SNPs and SNPs in 1000 Genomes data (European dataset in supporting directory, PHASE I data). Turns out that VCFtools can't properly convert data from this subset to PLINK data, making usage of PLINK / Haploview impossible. Instead, we've performed LD calculations using VCFtools itself, though computationally intensive, which works fine. Interestingly, converting to PED files works fine for the dataset of all populations combined (not subsets).

    ReplyDelete
  18. Hi Dr. Turner,

    I found your post to be extremely helpful. I was mainly interested in obtaining the frequency data.

    Is there a way to specify multiple regions in the ./tabix command that you used? For example, I would like to provide only exons from ~10 genes on different chromosomes. Is there a command that can tell tabix to read a bed file that has all of the coordinates?

    Thanks in advance,
    Mike

    ReplyDelete
    Replies
    1. In case you have not found this already, you can specify the -B option to tabix followed by a bed file, e.g. tabix fileName.gz -B test.bed

      Delete
  19. Not that I know of but writing a script to do this would be fairly trivial.

    ReplyDelete
  20. Hello,
    I would just like to say thank you for this incredibly helpful post! I was getting stuck on just what to install... Very helpful tutorial.

    ReplyDelete
  21. hello i ran the ccommand --remove-filtered-geno-all and this is what i got. please can any one help me by explaining what this means.

    Currently scanning CHROM: chr1
    Currently scanning CHROM: chr1_random
    Currently scanning CHROM: chr2
    Currently scanning CHROM: chr3
    Currently scanning CHROM: chr3_random
    Currently scanning CHROM: chr4
    Currently scanning CHROM: chr4_random
    Currently scanning CHROM: chr5
    Currently scanning CHROM: chr5_random
    Currently scanning CHROM: chr6
    Currently scanning CHROM: chr7
    Currently scanning CHROM: chr7_random
    Currently scanning CHROM: chr8
    Currently scanning CHROM: chr9
    Currently scanning CHROM: chr9_random
    Currently scanning CHROM: chr10
    Currently scanning CHROM: chr10_random
    Currently scanning CHROM: chr11
    Currently scanning CHROM: chr11_random
    Currently scanning CHROM: chr12
    Currently scanning CHROM: chr12_random
    Currently scanning CHROM: chr13
    Currently scanning CHROM: chr13_random
    Currently scanning CHROM: chr14
    Currently scanning CHROM: chr15
    Currently scanning CHROM: chr16
    Currently scanning CHROM: chr16_random
    Currently scanning CHROM: chr17
    Currently scanning CHROM: chr17_random
    Currently scanning CHROM: chr18
    Currently scanning CHROM: chr18_random
    Currently scanning CHROM: chr19
    Currently scanning CHROM: chrUn
    Keeping 579457 entries (out of 579457 read)
    Done
    Filtering out all genotypes with FILTER flag.
    Kept 6 out of 6 Individuals
    Kept 579457 out of 579457 Sites
    Run Time = 11.00 seconds

    ReplyDelete
  22. Stephen,

    First of all thanks for such a helpful resource in your blog! I am analyzing some WG data from the Sanger Wellcome Institute. I am able to retrieve the VCF files for a specific chromosome but vcftools wouldnt analyze my vcf file because it comes in vcf 3.3 format as opposed to 4.0 or 4.1. Is there any tool i could use to transform my data set into this newer format? Thanks a lot,

    Juan

    ReplyDelete
  23. Hi Juan,
    I haven't tried this myself, but there is a vcftools perl module to convert between different VCF formats:
    http://vcftools.sourceforge.net/perl_module.html#vcf-convert

    ~Sarah

    ReplyDelete
  24. This might be helpful for someone who are trying to work with the 1000 genome data,
    http://vat.gersteinlab.org/workflow.php

    -Sri

    ReplyDelete
  25. Hi Stephen,

    Thanks for the great tutorial.. Extremely useful.

    ReplyDelete
  26. This comment has been removed by the author.

    ReplyDelete
  27. Hi Stephen,

    Is there any way not to get indels while fetching the data using tabix ? Or later filtering of the vcf file is the only option ?

    ReplyDelete
  28. Hi Stephen and thanks for this useful post.
    Instead selecting a particular region (like the CETP region), is there a way to download a specific set of SNPs around the genome (let's say 100) using tabix and vcf tools?
    Thank you really much.

    ReplyDelete
  29. Hi Stephen,

    thank you for the useful post, i would like to know the allele frequencies out put file description, which is like
    CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
    scaffold1|size923514 3945 2 1 A:0 C:1
    scaffold1|size923514 19952 2 1 C:0 T:1
    scaffold1|size923514 19961 2 1 A:0 G:1
    scaffold1|size923514 19971 2 1 ACTCG:0 GCTCA:1
    scaffold1|size923514 19995 2 1 G:0 A:1
    scaffold1|size923514 20043 2 1 C:0 G:1

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.