Friday, January 20, 2012

Joint Techs Netcast: Enhancing Infrastructure Support for Data Intensive Science

The winter Joint Techs meeting is next week in Baton Rouge. I'm not going, but I plan on participating via a netcast to see what's going on. Jim Bottum, Clemson's CIO, is moderating an entire day devoted to the topic Enhancing Infrastructure Support for Data Intensive Science. Of particular interest to me are the talks from 9:30-11am Tuesday January 24 from researchers and those supporting climatology, genomics, and the XSEDE projects. The afternoon of January 24 has some talks from academic and government labs who've successfully deployed methods to enhance their infrastructure support for data intensive science. Check out the full agenda for the day here. These sessions sound particularly relevant for those researching and supporting large-scale genomics and bioinformatics projects.

Joint Techs Meeting Netcast

Tuesday, January 17, 2012

Annotating limma Results with Gene Names for Affy Microarrays

Lately I've been using the limma package often for analyzing microarray data. When I read in Affy CEL files using ReadAffy(), the resulting ExpressionSet won't contain any featureData annotation. Consequentially, when I run topTable to get a list of differentially expressed genes, there's no annotation information other than the Affymetrix probeset IDs or transcript cluster IDs. There are other ways of annotating these results (INNER JOIN to a MySQL database, biomaRt, etc), but I would like to have the output from topTable already annotated with gene information. Ideally, I could annotate each probeset ID with a gene symbol, gene name, Ensembl ID, and have that Ensembl ID hyperlink out to the Ensembl genome browser. With some help from Gordon Smyth on the Bioconductor Mailing list, I found that annotating the ExpressionSet object results in the output from topTable also being annotated.

The results from topTable are pretty uninformative without annotation:


After annotation:


You can generate an HTML file with clickable links to the Ensembl Genome Browser for each gene:


Here's the R code to do it:

Thursday, January 5, 2012

New Year's Resolution: Learn How to Code

Farhad Manjoo at Slate has a good article on why you need to learn how to program. Chances are, if you're reading this post here you're already fairly adept at some form of programming. But if you're not, you should give it some serious thought.

Gina Trapani, former editor of tech blog Lifehacker, is quoted in the article:
“Learning to code demystifies tech in a way that empowers and enlightens. When you start coding you realize that every digital tool you have ever used involved lines of code just like the ones you're writing, and that if you want to make an existing app better, you can do just that with the same foreach and if-then statements every coder has ever used.”
Farhad makes the point that programming is important even in traditionally non-computational fields: if you were a travel agent in the 90's and knew how to code, not only would you have been able to see the approaching inevitable collapse of your profession, but perhaps you would have been able to get in early on the dot-com travel industry boom.

Q&A sites for biologists are littered with questions from researchers asking for non-technical, code-free ways of doing a particular analysis. Your friendly bioinformatics or computational biology neighbor can often point to a resource or design a solution that can get you 90% of the way, but usually won't grok the biological problem as truly as you do. By learning even the smallest bit of programming, you can at least be equipped with the knowledge of what is programmatically possible, and collaborations with your bioinformatician can be more fruitful. As every field of biological research becomes more computational in nature, learning how to code is becoming more important than ever.

Where to start

Getting started really isn't that difficult. Grab a good text editor like Notepad++ for windows, TextMate or Macvim for Mac, or vim for Linux/Unix. What language should you start with? This can be a subject of intense debate, but in reality, it doesn't matter - just pick something that's relevant to what you're doing. If you know Perl or Java, you can pick up the basics of Ruby or C++ in a weekend. I started with Perl (using the Llama book), but for scientific computing and basic scripting/automation, I would recommend learning Python instead. While Perl lets you get away with sloppy coding, terse shortcuts, with the motto of "there's more than one way to do it," Python forces you to keep your code tidy, and has a model that there's probably one best way to do something, and that's the way you should use. Python has a huge following in the scientific community - chances are you'll find plenty of useful functionality in the BioPython and SciPy modules. I learned Python in an afternoon through watching videos and doing exercises in Google's Python Class, and the free book Dive Into Python is a great reference. If you're on Windows, you can get Python from ActiveState; if you're on Mac or Linux, you already have Python.

The Slate article also points to Code Year - a site that will send you interactive coding projects once a week throughout 2012 starting January 9. Code Year is from the creators of Code Academy - a site with a series of fun, interactive JavaScript tutorials. Lifehacker has a 5-part "Night School" series on the basics of programming. Once you have some basic programming chops, take a look at Stanford's free machine learningartificial intelligence, and Natural Language Processing classes to hone your scientific computing skills. Need a challenge? Try the Python Challenge for fun puzzles to hone your Python skills, or check out Project Euler if you want to tackle more math-oriented programming challenges with any language. The point is - there is no lack of free resources to help you get started or get better at programming.

Slate - You Need to Learn How to Program
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.