Tuesday, January 17, 2012

Annotating limma Results with Gene Names for Affy Microarrays

Lately I've been using the limma package often for analyzing microarray data. When I read in Affy CEL files using ReadAffy(), the resulting ExpressionSet won't contain any featureData annotation. Consequentially, when I run topTable to get a list of differentially expressed genes, there's no annotation information other than the Affymetrix probeset IDs or transcript cluster IDs. There are other ways of annotating these results (INNER JOIN to a MySQL database, biomaRt, etc), but I would like to have the output from topTable already annotated with gene information. Ideally, I could annotate each probeset ID with a gene symbol, gene name, Ensembl ID, and have that Ensembl ID hyperlink out to the Ensembl genome browser. With some help from Gordon Smyth on the Bioconductor Mailing list, I found that annotating the ExpressionSet object results in the output from topTable also being annotated.

The results from topTable are pretty uninformative without annotation:
ID logFC AveExpr t P.Value adj.P.Val B
7902702 -6.8 7.8 -46 1.0e-09 3.3e-05 11.0
8117594 -4.3 10.0 -33 9.6e-09 1.5e-04 10.0
8168794 -3.6 6.5 -30 1.7e-08 1.9e-04 9.7
8103736 -3.4 7.7 -28 3.1e-08 2.5e-04 9.3
7897426 -4.0 7.2 -26 4.7e-08 3.1e-04 9.0
8094278 -3.7 7.0 -24 7.4e-08 3.7e-04 8.7
view raw noanno.txt hosted with ❤ by GitHub


After annotation:
ID Symbol Name Ensembl logFC AveExpr t P.Value adj.P.Val B
7902702 CLCA2 chloride channel accessory 2 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000137975'>ENSG00000137975</a> -6.832901 7.793344 -46.35621 1.015175e-09 3.281148e-05 11.177075
8117594 HIST1H2BM histone cluster 1, H2bm <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000196374'>ENSG00000196374</a> -4.255756 10.096873 -33.21938 9.579122e-09 1.548034e-04 10.063255
8168794 CENPI centromere protein I <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000102384'>ENSG00000102384</a> -3.612035 6.486830 -30.41465 1.733246e-08 1.867342e-04 9.698515
8103736 SCRG1 stimulator of chondrogenesis 1 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000164106'>ENSG00000164106</a> -3.433737 7.690753 -27.94797 3.058621e-08 2.471442e-04 9.321687
7897426 CA6 carbonic anhydrase VI <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000131686'>ENSG00000131686</a> -4.048937 7.173867 -26.18729 4.732396e-08 3.059115e-04 9.014277
8094278 NCAPG non-SMC condensin I complex, subunit G <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000109805'>ENSG00000109805</a> -3.658660 7.008561 -24.48718 7.419854e-08 3.743457e-04 8.681846
view raw afteranno.txt hosted with ❤ by GitHub


You can generate an HTML file with clickable links to the Ensembl Genome Browser for each gene:


Here's the R code to do it:
# Install the necessary bioconductor packages. Use the appropriate annotation.db file
# for your organism/platform. See http://www.bioconductor.org/packages/release/data/annotation/
source("http://www.bioconductor.org/biocLite.R")
biocLite("affy")
biocLite("limma")
biocLite("annotate")
biocLite("hugene10sttranscriptcluster.db")
install.packages("R2HTML")
# Load required packages
library(affy)
library(limma)
library(annotate)
library(hugene10sttranscriptcluster.db)
library(R2HTML)
# RMA to convert Affybatch to ExpressionSet using robust multichip average
eset <- rma(myaffybatch)
class(eset)
# Which platform?
eset@annotation
# Which objects are in this annotation package?
ls("package:hugene10sttranscriptcluster.db")
# Get the transcript cluster IDs from the expressionset
ID <- featureNames(eset)
# Look up the Gene Symbol, name, and Ensembl Gene ID for each of those IDs
Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME"))
Ensembl <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "ENSEMBL"))
# Note: looking up Ensembl gene IDs by Affy probe IDs can sometimes result in multiple
# gene IDs being returned. This is why the default output from lookUp is a list.
# I'm coercing this list into a character vector so it's the same size as my list of IDs,
# But this is discarding information where one probe ID maps to multiple gene IDs.
# If you're aware of a better solution, please leave it in the comments! Thanks.
# Wherever you have a missing Ensembl ID, make hyperlink missing also.
# If you have an Ensembl ID, create a hyperlink that goes to the
# Ensembl genome browser. Change the URL if you're not using human.
Ensembl <- ifelse(Ensembl=="NA", NA,
paste("<a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=",
Ensembl, "'>", Ensembl, "</a>", sep=""))
# Make a temporary data frame with all those identifiers
tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name, Ensembl=Ensembl, stringsAsFactors=F)
# The stringsAsFactors makes "NA" characters. This fixes that problem.
tmp[tmp=="NA"] <- NA
# set the featureData for your expressionset using the data frame you created above.
fData(eset) <- tmp
# Clean up
rm(ID, Symbol, Name, Ensembl, tmp)
# Fit the model
fit <- lmFit(eset,design)
fit <- eBayes(fit)
tt <- topTable(fit, coef="tissue")
# Write out an HTML file with clickable links to the Ensembl Genome Browser.
HTML(tt, "out.html", append=F)
view raw annotatelimma.r hosted with ❤ by GitHub

8 comments:

  1. How about package affycoretools? Gives detailed html results with common annotations you want, up, down genes, fold changes, stats etc. corresponding to venn diagrams from limma..quite straightforward

    ReplyDelete
  2. affycoretools definitely sounds useful - will give it a go! Thanks.

    ReplyDelete
  3. Thanks for the script! I modified it to work with the Drosophila genome with Flybase links instead of Ensemble and it worked perfectly

    ReplyDelete
  4. Hi It looks like the R code link is broken now.

    ReplyDelete
    Replies
    1. Last line: Here's the R code to do it:
      Where is the R code? I thought you had a gist for it, and the link for the gist is gone?

      Delete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.