Friday, January 20, 2012

Joint Techs Netcast: Enhancing Infrastructure Support for Data Intensive Science

The winter Joint Techs meeting is next week in Baton Rouge. I'm not going, but I plan on participating via a netcast to see what's going on. Jim Bottum, Clemson's CIO, is moderating an entire day devoted to the topic Enhancing Infrastructure Support for Data Intensive Science. Of particular interest to me are the talks from 9:30-11am Tuesday January 24 from researchers and those supporting climatology, genomics, and the XSEDE projects. The afternoon of January 24 has some talks from academic and government labs who've successfully deployed methods to enhance their infrastructure support for data intensive science. Check out the full agenda for the day here. These sessions sound particularly relevant for those researching and supporting large-scale genomics and bioinformatics projects.

Joint Techs Meeting Netcast

Tuesday, January 17, 2012

Annotating limma Results with Gene Names for Affy Microarrays

Lately I've been using the limma package often for analyzing microarray data. When I read in Affy CEL files using ReadAffy(), the resulting ExpressionSet won't contain any featureData annotation. Consequentially, when I run topTable to get a list of differentially expressed genes, there's no annotation information other than the Affymetrix probeset IDs or transcript cluster IDs. There are other ways of annotating these results (INNER JOIN to a MySQL database, biomaRt, etc), but I would like to have the output from topTable already annotated with gene information. Ideally, I could annotate each probeset ID with a gene symbol, gene name, Ensembl ID, and have that Ensembl ID hyperlink out to the Ensembl genome browser. With some help from Gordon Smyth on the Bioconductor Mailing list, I found that annotating the ExpressionSet object results in the output from topTable also being annotated.

The results from topTable are pretty uninformative without annotation:
ID logFC AveExpr t P.Value adj.P.Val B
7902702 -6.8 7.8 -46 1.0e-09 3.3e-05 11.0
8117594 -4.3 10.0 -33 9.6e-09 1.5e-04 10.0
8168794 -3.6 6.5 -30 1.7e-08 1.9e-04 9.7
8103736 -3.4 7.7 -28 3.1e-08 2.5e-04 9.3
7897426 -4.0 7.2 -26 4.7e-08 3.1e-04 9.0
8094278 -3.7 7.0 -24 7.4e-08 3.7e-04 8.7
view raw noanno.txt hosted with ❤ by GitHub


After annotation:
ID Symbol Name Ensembl logFC AveExpr t P.Value adj.P.Val B
7902702 CLCA2 chloride channel accessory 2 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000137975'>ENSG00000137975</a> -6.832901 7.793344 -46.35621 1.015175e-09 3.281148e-05 11.177075
8117594 HIST1H2BM histone cluster 1, H2bm <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000196374'>ENSG00000196374</a> -4.255756 10.096873 -33.21938 9.579122e-09 1.548034e-04 10.063255
8168794 CENPI centromere protein I <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000102384'>ENSG00000102384</a> -3.612035 6.486830 -30.41465 1.733246e-08 1.867342e-04 9.698515
8103736 SCRG1 stimulator of chondrogenesis 1 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000164106'>ENSG00000164106</a> -3.433737 7.690753 -27.94797 3.058621e-08 2.471442e-04 9.321687
7897426 CA6 carbonic anhydrase VI <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000131686'>ENSG00000131686</a> -4.048937 7.173867 -26.18729 4.732396e-08 3.059115e-04 9.014277
8094278 NCAPG non-SMC condensin I complex, subunit G <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000109805'>ENSG00000109805</a> -3.658660 7.008561 -24.48718 7.419854e-08 3.743457e-04 8.681846
view raw afteranno.txt hosted with ❤ by GitHub


You can generate an HTML file with clickable links to the Ensembl Genome Browser for each gene:


Here's the R code to do it:
# Install the necessary bioconductor packages. Use the appropriate annotation.db file
# for your organism/platform. See http://www.bioconductor.org/packages/release/data/annotation/
source("http://www.bioconductor.org/biocLite.R")
biocLite("affy")
biocLite("limma")
biocLite("annotate")
biocLite("hugene10sttranscriptcluster.db")
install.packages("R2HTML")
# Load required packages
library(affy)
library(limma)
library(annotate)
library(hugene10sttranscriptcluster.db)
library(R2HTML)
# RMA to convert Affybatch to ExpressionSet using robust multichip average
eset <- rma(myaffybatch)
class(eset)
# Which platform?
eset@annotation
# Which objects are in this annotation package?
ls("package:hugene10sttranscriptcluster.db")
# Get the transcript cluster IDs from the expressionset
ID <- featureNames(eset)
# Look up the Gene Symbol, name, and Ensembl Gene ID for each of those IDs
Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME"))
Ensembl <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "ENSEMBL"))
# Note: looking up Ensembl gene IDs by Affy probe IDs can sometimes result in multiple
# gene IDs being returned. This is why the default output from lookUp is a list.
# I'm coercing this list into a character vector so it's the same size as my list of IDs,
# But this is discarding information where one probe ID maps to multiple gene IDs.
# If you're aware of a better solution, please leave it in the comments! Thanks.
# Wherever you have a missing Ensembl ID, make hyperlink missing also.
# If you have an Ensembl ID, create a hyperlink that goes to the
# Ensembl genome browser. Change the URL if you're not using human.
Ensembl <- ifelse(Ensembl=="NA", NA,
paste("<a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=",
Ensembl, "'>", Ensembl, "</a>", sep=""))
# Make a temporary data frame with all those identifiers
tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name, Ensembl=Ensembl, stringsAsFactors=F)
# The stringsAsFactors makes "NA" characters. This fixes that problem.
tmp[tmp=="NA"] <- NA
# set the featureData for your expressionset using the data frame you created above.
fData(eset) <- tmp
# Clean up
rm(ID, Symbol, Name, Ensembl, tmp)
# Fit the model
fit <- lmFit(eset,design)
fit <- eBayes(fit)
tt <- topTable(fit, coef="tissue")
# Write out an HTML file with clickable links to the Ensembl Genome Browser.
HTML(tt, "out.html", append=F)
view raw annotatelimma.r hosted with ❤ by GitHub

Thursday, January 5, 2012

New Year's Resolution: Learn How to Code

Farhad Manjoo at Slate has a good article on why you need to learn how to program. Chances are, if you're reading this post here you're already fairly adept at some form of programming. But if you're not, you should give it some serious thought.

Gina Trapani, former editor of tech blog Lifehacker, is quoted in the article:
“Learning to code demystifies tech in a way that empowers and enlightens. When you start coding you realize that every digital tool you have ever used involved lines of code just like the ones you're writing, and that if you want to make an existing app better, you can do just that with the same foreach and if-then statements every coder has ever used.”
Farhad makes the point that programming is important even in traditionally non-computational fields: if you were a travel agent in the 90's and knew how to code, not only would you have been able to see the approaching inevitable collapse of your profession, but perhaps you would have been able to get in early on the dot-com travel industry boom.

Q&A sites for biologists are littered with questions from researchers asking for non-technical, code-free ways of doing a particular analysis. Your friendly bioinformatics or computational biology neighbor can often point to a resource or design a solution that can get you 90% of the way, but usually won't grok the biological problem as truly as you do. By learning even the smallest bit of programming, you can at least be equipped with the knowledge of what is programmatically possible, and collaborations with your bioinformatician can be more fruitful. As every field of biological research becomes more computational in nature, learning how to code is becoming more important than ever.

Where to start

Getting started really isn't that difficult. Grab a good text editor like Notepad++ for windows, TextMate or Macvim for Mac, or vim for Linux/Unix. What language should you start with? This can be a subject of intense debate, but in reality, it doesn't matter - just pick something that's relevant to what you're doing. If you know Perl or Java, you can pick up the basics of Ruby or C++ in a weekend. I started with Perl (using the Llama book), but for scientific computing and basic scripting/automation, I would recommend learning Python instead. While Perl lets you get away with sloppy coding, terse shortcuts, with the motto of "there's more than one way to do it," Python forces you to keep your code tidy, and has a model that there's probably one best way to do something, and that's the way you should use. Python has a huge following in the scientific community - chances are you'll find plenty of useful functionality in the BioPython and SciPy modules. I learned Python in an afternoon through watching videos and doing exercises in Google's Python Class, and the free book Dive Into Python is a great reference. If you're on Windows, you can get Python from ActiveState; if you're on Mac or Linux, you already have Python.

The Slate article also points to Code Year - a site that will send you interactive coding projects once a week throughout 2012 starting January 9. Code Year is from the creators of Code Academy - a site with a series of fun, interactive JavaScript tutorials. Lifehacker has a 5-part "Night School" series on the basics of programming. Once you have some basic programming chops, take a look at Stanford's free machine learningartificial intelligence, and Natural Language Processing classes to hone your scientific computing skills. Need a challenge? Try the Python Challenge for fun puzzles to hone your Python skills, or check out Project Euler if you want to tackle more math-oriented programming challenges with any language. The point is - there is no lack of free resources to help you get started or get better at programming.

Slate - You Need to Learn How to Program
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.