Friday, January 20, 2012

Joint Techs Netcast: Enhancing Infrastructure Support for Data Intensive Science

The winter Joint Techs meeting is next week in Baton Rouge. I'm not going, but I plan on participating via a netcast to see what's going on. Jim Bottum, Clemson's CIO, is moderating an entire day devoted to the topic Enhancing Infrastructure Support for Data Intensive Science. Of particular interest to me are the talks from 9:30-11am Tuesday January 24 from researchers and those supporting climatology, genomics, and the XSEDE projects. The afternoon of January 24 has some talks from academic and government labs who've successfully deployed methods to enhance their infrastructure support for data intensive science. Check out the full agenda for the day here. These sessions sound particularly relevant for those researching and supporting large-scale genomics and bioinformatics projects.

Joint Techs Meeting Netcast

Tuesday, January 17, 2012

Annotating limma Results with Gene Names for Affy Microarrays

Lately I've been using the limma package often for analyzing microarray data. When I read in Affy CEL files using ReadAffy(), the resulting ExpressionSet won't contain any featureData annotation. Consequentially, when I run topTable to get a list of differentially expressed genes, there's no annotation information other than the Affymetrix probeset IDs or transcript cluster IDs. There are other ways of annotating these results (INNER JOIN to a MySQL database, biomaRt, etc), but I would like to have the output from topTable already annotated with gene information. Ideally, I could annotate each probeset ID with a gene symbol, gene name, Ensembl ID, and have that Ensembl ID hyperlink out to the Ensembl genome browser. With some help from Gordon Smyth on the Bioconductor Mailing list, I found that annotating the ExpressionSet object results in the output from topTable also being annotated.

The results from topTable are pretty uninformative without annotation:


After annotation:


You can generate an HTML file with clickable links to the Ensembl Genome Browser for each gene:


Here's the R code to do it:

Thursday, January 5, 2012

New Year's Resolution: Learn How to Code

Farhad Manjoo at Slate has a good article on why you need to learn how to program. Chances are, if you're reading this post here you're already fairly adept at some form of programming. But if you're not, you should give it some serious thought.

Gina Trapani, former editor of tech blog Lifehacker, is quoted in the article:
“Learning to code demystifies tech in a way that empowers and enlightens. When you start coding you realize that every digital tool you have ever used involved lines of code just like the ones you're writing, and that if you want to make an existing app better, you can do just that with the same foreach and if-then statements every coder has ever used.”
Farhad makes the point that programming is important even in traditionally non-computational fields: if you were a travel agent in the 90's and knew how to code, not only would you have been able to see the approaching inevitable collapse of your profession, but perhaps you would have been able to get in early on the dot-com travel industry boom.

Q&A sites for biologists are littered with questions from researchers asking for non-technical, code-free ways of doing a particular analysis. Your friendly bioinformatics or computational biology neighbor can often point to a resource or design a solution that can get you 90% of the way, but usually won't grok the biological problem as truly as you do. By learning even the smallest bit of programming, you can at least be equipped with the knowledge of what is programmatically possible, and collaborations with your bioinformatician can be more fruitful. As every field of biological research becomes more computational in nature, learning how to code is becoming more important than ever.

Where to start

Getting started really isn't that difficult. Grab a good text editor like Notepad++ for windows, TextMate or Macvim for Mac, or vim for Linux/Unix. What language should you start with? This can be a subject of intense debate, but in reality, it doesn't matter - just pick something that's relevant to what you're doing. If you know Perl or Java, you can pick up the basics of Ruby or C++ in a weekend. I started with Perl (using the Llama book), but for scientific computing and basic scripting/automation, I would recommend learning Python instead. While Perl lets you get away with sloppy coding, terse shortcuts, with the motto of "there's more than one way to do it," Python forces you to keep your code tidy, and has a model that there's probably one best way to do something, and that's the way you should use. Python has a huge following in the scientific community - chances are you'll find plenty of useful functionality in the BioPython and SciPy modules. I learned Python in an afternoon through watching videos and doing exercises in Google's Python Class, and the free book Dive Into Python is a great reference. If you're on Windows, you can get Python from ActiveState; if you're on Mac or Linux, you already have Python.

The Slate article also points to Code Year - a site that will send you interactive coding projects once a week throughout 2012 starting January 9. Code Year is from the creators of Code Academy - a site with a series of fun, interactive JavaScript tutorials. Lifehacker has a 5-part "Night School" series on the basics of programming. Once you have some basic programming chops, take a look at Stanford's free machine learningartificial intelligence, and Natural Language Processing classes to hone your scientific computing skills. Need a challenge? Try the Python Challenge for fun puzzles to hone your Python skills, or check out Project Euler if you want to tackle more math-oriented programming challenges with any language. The point is - there is no lack of free resources to help you get started or get better at programming.

Slate - You Need to Learn How to Program

Thursday, December 15, 2011

Query a MySQL Database from R using RMySQL

I use this all the time, and the setup is dead simple. Follow the code below to load the RMySQL package, connect to a database (here the UCSC genome browser's public MySQL instance), set up a function to make querying easier, and query the database to return results as a data frame.

Galaxy Project Group on CiteULike and Mendeley


The Galaxy Project started using CiteULike to organize papers that are about, use, or reference Galaxy. The Galaxy CiteULike group is open to any CUL user, and once you join, you can add papers to the group, assign tags, and rate papers.

While not a CUL user, I'm a big fan of Mendeley for managing references, PDFs, and creating bibliographies (and so are many of you). I'm happy to hear that the Galaxy folks also set up a Galaxy Mendeley Group, also open to the public for anyone to join.  If you join the Galaxy public Mendeley group, all of the groups references will show up in your Mendeley library (and these won't count against your personal quota).

Just one important thing to note: The Mendeley group is a mirror of the CiteULike group, so if you want to add more publications to the Galaxy Group, add them on CiteULike, not Mendeley (it doesn't work the other way around - papers added to Mendeley won't make it to the CUL group).

Galaxy Project Group on CiteULike and Mendeley

Thursday, December 8, 2011

RNA-Seq & ChiP-Seq Data Analysis Course at EBI

I just got this announcement from EMBL-EBI about an RNA-seq/ChIP-seq analysis hands-on course. Find the full details, schedule, and speaker list here.

Title: Advanced RNA-Seq and Chip-Seq Data Analysis Course
Date: May 1-4 2012
Venue: EMBL-EBI, Hinxton, Nr Cambridge, CB10 1SD, UK
Registration Closing Date: March 6 2012 (12:00 midday GMT)

This course is aimed at advanced PhD students and post-doctoral researchers who are applying or planning to apply high throughput sequencing technologies and bioinformatics methods in their research. The aim of this course is to familiarize the participants with advanced data analysis methodologies and provide hands-on training on the latest analytical approaches.

Lectures will give insight into how biological knowledge can be generated from RNA-seq and ChIP-seq experiments and illustrate different ways of analyzing such data Practicals will consist of computer exercises that will enable the participants to apply statistical methods to the analysis of RNA-seq and ChIP-seq data under the guidance of the lecturers and teaching assistants. Familiarity with the technology and biological use cases of high throughput sequencing is required, as is some experience with R/Bioconductor.

The course covers data analysis of RNA-Seq and ChIP-Seq experiments.
Topics will include: alignment, data handling and visualisation, region identification, differential expression, data quality assessment and statistical analysis, using R/Bioconductor.

Tuesday, December 6, 2011

An example RNA-Seq Quality Control and Analysis Workflow

I found the slides below on the education page from Bioinformatics & Research Computing at the Whitehead Institute. The first set (PDF) gives an overview of the methods and software available for quality assessment of microarray and RNA-seq experiments using the FastX toolkit and FastQC.



The second set (PDF)  gives an example RNA-seq workflow using TopHat, SAMtools, Python/HTseq, and R/DEseq.



If you're doing any RNA-seq work these are both really nice resources to help you get a command-line based analysis workflow up and running (if you're not using Galaxy for RNA-seq).

Monday, December 5, 2011

Webinar: Applications of Next-Generation Sequencing in Clinical Care

I just got an email from Illumina about a webinar that looks interesting this Wednesday at 9am PST (noon EST) on clinical applications of next-gen sequencing.

Date: Wednesday, December 7, 2011
Time: 9:00 AM (PST)
Speaker: Rick Dewey, MD, Stanford Center for Inherited Cardiovascular Disease



Next-generation sequencing (NGS) presents both challenges and opportunities for clinical care. Dr. Dewey will share examples from his experience at Stanford, successful and otherwise, in which NGS has been applied to cases of familial cardiomyopathy, and other inherited conditions. Bring your questions for a Q&A session. In this webinar, Dr. Dewey will discuss approaches to: Data storage and management; Error identification and reduction; Disease risk encoded in the reference sequence; and Variant validation.

The webinar will be recorded and available to you afterwards if you register.

Registration - Applications of Next-Generation Sequencing in Clinical Care
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.