Thursday, June 30, 2011

Mapping SNPs to Genes for GWAS Enrichment Analysis

There are several tools available for conducting a post-hoc analysis of GWAS data looking for enrichment of significant SNPs using literature or pathway based resources. Examples include GRAIL, ALLIGATOR, and WebGestalt among others (see SNPath R Package). Since gene enrichment and pathway analysis essentially evolved from methods for analyzing gene expression data, many of these tools require specific gene identifiers as input.

To prep GWAS data for analyses such as these, you'll need to ensure that your SNPs have current map position information, and assign each SNP to genes. To update your map information, the easiest resource to use is the UCSC Tables browser.

To use UCSC, go to the TABLES link on the browser page. Select the current assembly, "Variation and Repeats" for group, "Common SNPs(132)" for the track, and "snp132Commmon" for the table. Now you'll need to upload your SNP list. You can manually edit the map file using a tool like Excel or Linux commands like awk to extract the rs_ids for your SNPs into a text file. Upload this file by clicking the "Upload List" button in the "identifiers" section. Once this file has uploaded, select "BED - browser extensible data" for the output format and click "Get Output". The interface will ask you if you want to add upstream or downstream regions -- for SNPs this makes little sense, so leave these at zero. Also, ensure that "Send to Galaxy" is not checked.

This will export a BED file containing the positions of all your SNPs in the build you selected, preferably the most current one -- let's call it SNPs.bed. BED files were developed by the folks behind the UCSC browser to document genomic annotations. The basic format is:

Chromosome|StartPosition|EndPosition|FeatureLabel


Because BED files were designed to list genomic ranges, SNPs are listed with a range of 1 BP, meaning the StartPosition is the location of the SNP and the EndPosition is the location of the SNP + 1.

The next step is to map these over to genes. To save a lot of effort resolving identifiers (which is no fun, I promise), I have generated a list of ~17,000 genes that are consistently annotated across Ensembl and Entrez-gene databases, and which have HUGO identifiers. These files are available here:

Hugo Genes
Entrez Genes
Ensembl Genes

The genes listed in these files were generated by comparing the cross-references between the Ensembl and Entrez-gene databases. To arrive at a set of "consensus" genes, I selected only genes where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene. Nearly all cases of inconsistent cross-referencing are genes annotated based on electronic predictions, or genes with multiple or inconsistent mappings. For the purpose of doing enrichment analysis, these would be ignored anyway, as they aren't likely to show up in pathway or functional databases. For these genes, we then obtained the HUGO approved gene identifier. The coordinates for all genes are mapped using hg19/GRChB37.

BED files are plain text and easy to modify. If you wanted to add 10KB to the gene bounds, for example, you could alter these files using excel formulas, or with a simple awk command:

awk '{printf("%d\t%d\t%d\t%s\n", $1, $2-10000, $3+10000, $4); }' ensemblgenes.bed


You'll need to download a collection of utilities called http://www.blogger.com/img/blank.gif BEDTools. If you have access to a Linux system, this should be fairly straightforward. These utilities are already installed on Vanderbilt's Linux application servers, but to compile the program yourself, download the .tar.gz file above, and use the following commands to compile:

tar -zxvf BEDTools..tar.gz
cd BEDTools
make clean
make all
ls bin

# copy the binaries to a directory in your PATH. e.g.,
sudo cp bin/* /usr/local/bin

Once you get BEDTools installed, there is a utility called intersectBed. This command allows you to examine the overlap of two BED files. Since we have our collection of SNPs and our collections of Genes in BED file format already, all we need to do is run the command.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb


This will print to the screen any SNP that falls within an EntrezGene boundary, along with the coordinates from each entry in each file. To strip this down to only RS numbers and gene identifiers, we can pipe this through awk.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb | awk '{printf("%s\t%s\n", $4, $8);}'

This will produce a tab-delimited file with SNPs that fall within genes identified by EntrezGene ID number. For HUGO names or Ensembl gene IDs, use the corresponding BED file.

19 comments:

  1. Hi,
    We have a perl utility which can makes these files using the ensembl perl API as part of our tool FORGE. This link gives the very short tutorial:

    https://github.com/inti/FORGE/wiki/Tutorials-1%3A-Make-snp-to-gene-mapping.

    It can be downloaded from the other link below along with the rest of package to conduct gene-based analysis of GWAS and subsequent pathway analysis.

    https://github.com/inti/FORGE

    Regards

    Gerome Breen gerome.breen AT kcl.ac.uk

    ReplyDelete
  2. Thanks Gerome! I haven't used FORGE yet, but I'll add it to the list -- I really like the idea of a gene-based statistic for enrichment analyses. It seems like a better way to handle the biases due to gene size, etc.

    ReplyDelete
  3. Hi.

    We also get a tool PfSNP http://pfs.nus.edu.sg which can easily map the SNP to gene. It requires rs No. And gives extra information like whether the SNP is potentially functional or not.

    ReplyDelete
  4. What great timing! I was just sitting down to figure out how to do this, and now I see you've done it already!

    ReplyDelete
  5. Thanks for the helpful blog, it's such a great resource!

    One quick question. I can see that some Hugo genes don't have a corresponding Entrez ID (22,975 HUGO genes in the genelist vs 17,685 Entrez ID's). Forgive the ignorant question, but I am very keen on using these genelists so just wondering why this is?

    ReplyDelete
  6. That's not an ignorant question at all! First, I believe the HUGO file has 20975 (not 22975) -- still there are ~3000 genes that don't overlap. When building these lists, we examined the agreement between what Ensembl calls a "gene" and what Entrez calls a "gene". The 17685 from Entrez represents the set of genes where the two resources agree. We actually pulled the hugo symbols from Ensembl and didn't really restrict that set at all. Most of the genes that aren't common between the sets are computational predictions or genes that aren't characterized very well and thus aren't likely to be useful for enrichment analyses.

    ReplyDelete
  7. [Full disclosure: coauthor of following rec]

    You might also want to look at DAPPLE, which defines enriched networks from protein interaction data rather than testing previously defined pathways.

    http://www.broadinstitute.org/mpg/dapple/dapple.php

    Cheers
    Chris

    ReplyDelete
  8. Hello, This post is very helpful, however when I running the command, it is not printing anything to screen or creating a new file. So I'm not sure if this is running properly. I followed all the instructions carefully. I even tried intersecting with all three gene lists, with no luck.

    Where would this new tab-limited file be created?

    I am not even getting any errors when running the command, would this just mean that my list of snps are not mapping to any gene?

    Thank you so much for your help

    ReplyDelete
    Replies
    1. Hi Rish, I have the same problem now. Had you figured out a solution?

      Delete
    2. I've figured out the solution: the intersection is empty because the first columns with the chromosome numbers don't match (when in one file it's chr1 in the other it's just 1). We can fix the problem by adding chr to the first column of entrezgenes.bed. This command creates the file entregeneschr.bed where the prefix chr is added:
      sed -e 's/^/chr/' entrezgenes.bed>entrezgeneschr.bed
      Great thanks for the blog!

      Delete
  9. Hi, Thank you for providing such user friendly tools! I tried to use your codes to map single SNP to multiple Genes within a certain distance range, but I have following problems:

    1)The following command will produce negative starting positions for some SNPs and it seems your tool "intersectBed" doesn't allow negative positions.
    awk '{printf("%d\t%d\t%d\t%s\n", $1, $2-10000, $3+10000, $4); }' ensemblgenes.bed

    2) After I corrected the negative positions by replacing them with 0s in the bed file, the following command can be run but produced empty result.
    intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb | awk '{printf("%s\t%s\n", $4, $8);}'

    ReplyDelete
  10. I am perhaps too much of a dilettante to be meandering this board... but all people messing with genetics data should at least try to do the best analysis, right? So, I would like to do a gene based analysis form GWAS data. I.e. get a p-value per gene (preferably corrected for multiple testing) not per SNP. I am at the point where I have just my GWAS ouput: a list of rs numbers from the affy 6.0, their chr, position and the P-value for association with a continuous trait. Any ideas how I might be able to go from this to a gene based P-value set? I have no a priori set of genes that I am interested in, so anything that involves looking things up by hand is going to be somewhat unfeasible.

    Thanks for any pointers anyone can give,

    Lekki

    ReplyDelete
  11. Lekki - please see some of the links in the first paragraph of this post. You have to be careful of pathway or enrichment analysis with GWAS data. You have a problem here that you don't have with microarrays - namely, that you have many SNPs per gene, different gene sizes, and different SNP densities in genes. Think of the problem like this. Imagine the null is true - you have random numbers for your phenotype. A large, SNP-dense gene, say a gene with 100 SNPs, will have about 5 SNPs with a nominal p-value of <0.05. Compare this to a gene with only one SNP - it probably won't be <0.05 by chance. So you have this problem that large SNP-dense genes will by chance have a significantly associated SNP than smaller, SNP-sparse genes. If you condense entire genes down to a single-best p-value, you're biasing any kind of downstream pathway or enrichment analysis towards larger genes with more SNPs. And this doesn't even get into the LD problem. Methods like ALLIGATOR have permutation procedures built into the method to account for this problem. So do some of the SNPath functions. Take a look at some of those to get you started.

    ReplyDelete
  12. Oh goodness, I had not even thought of it that way. Thank you. I will dig more and see where I go. I appreciate your help...

    ReplyDelete
  13. thanks for providing the gene files, i am going to use them.

    ReplyDelete
  14. Hi Stephen!
    I have been using this annotation method and I wanted to share an issue I came upon.
    When I download the *.bed from UCSC, the chromosome column format is, e.g.,: "chr4", when in the list you provide the format is simply "4". If we execute bedtools this way, it will not retrieve any result becuase chromosomes do not match. Even if it is a straightforward thing to correct, maybe it could complicate things at first until you realize about it.
    Apart from this, I have to say that all this approach you suggest works fine. And the gene lists are specially appreciated... ;)

    Best,

    juan

    ReplyDelete
  15. Hi Stephen
    Thanks very much for sharing experiences.
    I am a big fan of Getting Genetics Done. It is really helpful to me.
    However, I found that GWAS Enrichment for some species such as pig are very poor while many overlapped packages just for focusing on human data. The only way we can do by trying to using Biomart to get all GOs, and KEGG to get all pathways... than to perform test by our-self.
    Do you have any suggestion in this cases?
    Best regards
    DDO

    ReplyDelete
  16. Hi Stephen
    Thanks for the great website, and all the tutotrials.
    I have a question if you could assist me - I have th output from a GWAS, and want now to do the permutations based set-based test in Plink, to have a gene-based analysis. I need for that a list with ll rsnumbers assign to each gene, in a list. After seeing all these steps I am not sure the result from here is what I want. Am I doing something wrong, or could you advise for which other tools I should use?

    Thank you in advance,
    my best regards

    Nuno

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.