Getting Genetics Done

Wednesday, February 20, 2013

NetGestalt for Data Visualization in the Context of Pathways

Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.

NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.

Netgestalt provides a very nice interface for interacting with data. Extensive documentation on how to use it can be found here. Bing and his colleagues also went the extra mile to create video tutorials on how to use their web tool, and walk you through an analysis of some tumor data.

http://www.netgestalt.org/

Tuesday, February 12, 2013

"Document Design and Purpose, Not Mechanics"

If you ever write code for scientific computing (chances are you do if you're here), stop what you're doing and spend 8 minutes reading this open-access paper:

Wilson et al. Best Practices for Scientific Computing. arXiv:1210.0530 (2012). (Direct link to PDF).

The paper makes a number of good points regarding software as a tool just like any other lab equipment: it should be built, validated, and used as carefully as any other physical instrumentation. Yet most scientists who write software are self-taught, and haven't been properly trained in fundamental software development skills.

The paper outlines ten practices every computational biologist should adopt when writing code for research computing. Most of these are the usual suspects that you'd probably guess - using version control, workflow management, writing good documentation, modularizing code into functions, unit testing, agile development, etc. One that particularly jumped out at me was the recommendation to document design and purpose, not mechanics.

We all know that good comments and documentation is critical for code reproducibility and maintenance, but inline documentation that recapitulates the code is hardly useful. Instead, we should aim to document the underlying ideas, interface, and reasons, not the implementation.

For example, the following commentary is hardly useful:

# Increment the variable "i" by one.

i = i+1

The real recommendation here is that if your code requires such substantial documentation of the actual implementation to be understandable, it's better to spend the time rewriting the code rather than writing a lengthy description of what it does. I'm very guilty of doing this with R code, nesting multiple levels of functions and vector operations:

# It would take a paragraph to explain what this is doing.

# Better to break up into multiple lines of code.

sapply(data.frame(n=sapply(x, function(d) sum(is.na(d)))), function(dd) mean(dd))

It would take much more time to properly document what this is doing than it would take to split the operation into manageable chunks over multiple lines such that the code no longer needs an explanation. We're not playing code golf here - using fewer lines doesn't make you a better programmer.

Best Practices for Scientific Computing

Monday, January 28, 2013

Scotty, We Need More Power! Power, Sample Size, and Coverage Estimation for RNA-Seq

Two of the most common questions at the beginning of an RNA-seq experiments are "how many reads do I need?" and "how many replicates do I need?". This paper describes a web application for designing RNA-seq applications that calculates an appropriate sample size and read depth to satisfy user-defined criteria such as cost, maximum number of reads or replicates attainable, etc. The power and sample size estimations are based on a t-test, which the authors claim, performs no worse than the negative binomial models implemented by popular RNA-seq methods such as DESeq, when there are three or more replicates present. Empirical distributions are taken from either (1) pilot data that the user can upload, or (2) built in publicly available data. The authors find that there is substantial heterogeneity between experiments (technical variation is larger than biological variation in many cases), and that power and sample size estimation will be more accurate when the user provides their own pilot data.

My only complaint, for all the reasons expressed in my previous blog post about why you shouldn't host things like this exclusively on your lab website, is that the code to run this analysis doesn't appear to be available to save, study, modify, maintain, or archive. When lead author Michele Busby leaves Gabor Marth's lab, hopefully the app doesn't fall into the graveyard of computational biology web apps. Update 2/7/13: Michele Busby created a public Github repository for the Scotty code: https://github.com/mbusby/Scotty

tl;dr? There's a new web app that does power, sample size, and coverage calculations for RNA-seq, but it only works well if the pilot or public data you give it closely matches the actual data you'll collect.

Paper: Busby, et al. "Scotty: A Web Tool For Designing RNA-Seq Experiments to Measure Differential Gene Expression." Bioinformatics (2013): 10.1093/bioinformatics/btt015.

Web app: http://euler.bc.edu/marthlab/scotty/scotty.php

Source code: https://github.com/mbusby/Scotty

Monday, January 14, 2013

The Pacific Symposium on Biocomputing 2013

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

Cheng et al. examine various analytical methods for processing data from the Connectivity Map, a dataset of gene expression changes due to small molecule treatment. They compare methods for identifying drug-induced gene expression profiles to a benchmark based on the Anatomical Theraputic Chemical (ATC) system with the hope of discovering additional mechanisms of action.

Huang et al. developed a recursive K-means spectral clustering algorithm and applied this method to gene expression data from the Cancer Genome Atlas. It provides better cluster separation than traditional hierarchical clustering, and better execution time than similar K-means approaches.

Schrider et al. used pooled paired-end sequence data from multiple Drosophila melanogaster species along the eastern US coast to identify copy number variants under selective pressure. Many of the CNVs identified contain CYP enzymes likely influencing insecticide resistance. Schrider also pointed out in his talk that human salivary amylase (AMY1) has copy numbers that are differentiated across human populations due to differences in dietary starch content. Cool!

Verspoor et al. presented an awesome application of text mining to identify catalytic protein residues from the biomedical literature. Text mining tasks are always wrought with difficulties such as identifier ambiguity and resolution, or simply identifying the corpus of text needed for the task. Using Literature-Enhanced Automated Prediction of Functional Sites (LEAP-FS) and the Protein Data Bank (with Pubmed references), they compare their text mining approach to the Catalytic Site Atlas as a ‘silver standard’. Despite the difficulty, a simple classifier gives an accuracy around 70% (measured by F-statistic).

Also, my colleague Ting Hu presented her excellent work on statistical epistasis networks which use entropy-based measures to identify high-order interactions in genetic data. And in case you are interested, I’ll end by shamelessly listing our own publications in complex data analysis and rare-variant population structure (with Marylyn Ritchie), and performance of the Illumina Metabochip in Hispanic samples and high-throughput epidemiology (with Dana Crawford).

PSB is always a fantastic meeting – hope to see you in 2014!

Tuesday, January 8, 2013

Stop Hosting Data and Code on your Lab Website

It's happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there's no trace of what you were looking for.

THE PROBLEM

This isn't an uncommon problem. See the following two articles:

Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PLoS one 6.9 (2011): e24914.

Wren, Jonathan D. "404 not found: the stability and persistence of URLs published in MEDLINE." Bioinformatics 20.5 (2004): 668-672.

The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:

Only 72% were still available at the published address.
The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
The authors could only confirm positive functionality for 45%.
Only 274 of the 872 corresponding authors answered an email.
Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.

The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

OpenHelix recently started this thread on Biostar as an obituary section for bioinformatics tools and resources that have vanished.

It's a fact that most of us academics move around a fair amount. Often we may not deem a tool we developed or data we collected and released to be worth transporting and maintaining. After some grace period, the resource disappears without a trace.

SOFTWARE

I won't spend much time here because most readers here are probably aware of source code repositories for hosting software projects. Unless you're not releasing the source code to your software (aside: starting an open-source project is a way to stake a claim in a field, not a real risk for getting yourself scooped), I can think of no benefit for hosting your code on your lab website when there are plenty of better alternatives available, such as Sourceforge, GitHub, Google Code, and others. In addition to free project hosting, tools like these provide version control, wikis, bug trackers, mailing lists and other services to enable transparent and open development with the end result of a better product and higher visibility. For more tips on open scientific software development, see this short editorial in PLoS Comp Bio:

Prlić A, Procter JB (2012) Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8(12): e1002802.

Casey Bergman recently analyzed where bioinformaticians are hosting their code, where he finds that the growth rate of Github is outpacing both Google Code and Sourceforge. Indeed, Github hosts more repositories than there are articles in Wikipedia, and has an excellent tutorial and interactive learning modules to help you learn how to use it. However, Bergman also points out how easy it is to delete a repository from Github and Google Code, where repositories are published by individuals who hold the keys to preservation (as opposed to Sourceforge, where it is extremely difficult to remove a project once it's been released).

DATA, FIGURES, SLIDES, WEB SERVICES, OR ANYTHING ELSE

For everything else there's Figshare. Figshare lets you host and publicly share unlimited data (or store data privately up to 1GB). The name suggests a site for sharing figures, but Figshare allows you to permanently store and share any research object. That can be figures, slides, negative results, videos, datasets, or anything else. If you're running a database server or web service, you can package up the source code on one of the repositories mentioned above, and upload to Figshare a virtual machine image of the server running it, so that the service will be available to users long after you've lost the time, interest, or money to maintain it.

Research outputs stored at Figshare are archived in the CLOCKSS geographically and geopolitically distributed network of redundant archive nodes, located at 12 major research libraries around the world. This means that content will remain available indefinitely for everyone after a "trigger event," and ensures this work will be maximally accessible and useful over time. Figshare is hosted using Amazon Web Services to ensure the highest level of security and stability for research data.

Upon uploading your data to Figshare, your data becomes discoverable, searchable, shareable, and instantly citable with its own DOI, allowing you to instantly take credit for the products of your research.

To show you how easy this is, I recently uploaded a list of "consensus" genes generated by Will Bush where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene (discussed in more detail in this previous post).

Create an account, and hit the big upload link. You'll be given a screen to drag and drop anything you'd like here (there's also a desktop uploader for larger files).

Once I dropped in the data I downloaded from Vanderbilt's website linked from the original blog post, I enter some optional metadata, a description, a link back to the original post:

I then instantly receive a citeable DOI where the data is stored permanently, regardless of Will's future at Vanderbilt:

Ensembl/Entrez hg19/GRCh37 Consensus Genes. Stephen Turner. figshare. Retrieved 21:31, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.103113

There are also links to the side that allow you to export that citation directly to your reference manager of choice.

Finally, as an experiment, I also uploaded this entire blog post to Figshare, which is now citeable and permanently archived at Figshare:

Stop Hosting Data and Code on your Lab Website. Stephen Turner. figshare. Retrieved 22:51, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.105125.

Friday, January 4, 2013

Twitter Roundup, January 4 2013

I've said it before: Twitter makes me a lazy blogger. Lots of stuff came across my radar this week that didn't make it into a full blog post. Here's a quick recap:

PLOS Computational Biology: Chapter 1: Biomedical Knowledge Integration

Assuring the quality of next-generation sequencing in clinical laboratory practice : Nature Biotechnology

De novo genome assembly: what every biologist should know : Nature Methods

How deep is deep enough for RNA-Seq profiling of bacterial transcriptomes?

Silence | Abstract | Strand-specific libraries for high throughput RNA sequencing (RNA-Seq) prepared without poly(A) selection

BMC Genomics | Abstract | Comparison of metagenomic samples using sequence signatures

Peak identification for ChIP-seq data with no controls.

TrueSight: a new algorithm for splice junction detection using RNA-seq

DiffCorr: An R package to analyze and visualize differential correlations in biological networks.

PLOS ONE: Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons

Delivering the promise of public health genomics | Global Development Professionals Network

Metagenomics and Community Profiling: Culture-Independent Techniques in the Clinical Laboratory

PLOS ONE: A Model-Based Clustering Method for Genomic Structural Variant Prediction and Genotyping Using Paired-End Sequencing Data

InnoCentive - Metagenomics Challenge

Wednesday, January 2, 2013

Computing for Data Analysis, and Other Free Courses

Coursera's free Computing for Data Analysis course starts today. It's a four week long course, requiring about 3-5 hours/week. A bit about the course:

In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment, discuss generic programming language concepts as they are implemented in a high-level statistical language. The course covers practical issues in statistical computing which includes programming in R, reading data into R, creating informative data graphics, accessing R packages, creating R packages with documentation, writing R functions, debugging, and organizing and commenting R code. Topics in statistical data analysis and optimization will provide working examples.

There are also hundreds of other free courses scheduled for this year. While the Computing for Data Analysis course is more about using R, the Data Analysis course is more about the methods and experimental designs you'll use, with a smaller emphasis on the R language. There are also courses on Scientific Computing, Algorithms, Health Informatics in the Cloud, Natural Language Processing, Introduction to Data Science, Scientific Writing, Neural Networks, Parallel Programming, Statistics 101, Systems Biology, Data Management for Clinical Research, and many, many others. See the link below for the full listing.

Free Courses on Coursera

Monday, December 31, 2012