Thursday, February 20, 2014

Data Analysis for Genomics MOOC

Last month I told you about Coursera's specializations in data science, systems biology, and computing. Today I was reading Jeff Leek's blog post defending p-values and found a link to HarvardX's Data Analysis for Genomics course, taught by Rafael Irizarry and Mike Love. Here's the course description:

Data Analysis for Genomics will teach students how to harness the wealth of genomics data arising from new technologies, such as microarrays and next generation sequencing, in order to answer biological questions, both for basic cell biology and clinical applications.

The purpose of this course is to enable students to analyze and interpret data generated by modern genomics technology, specifically microarray data and next generation sequencing data. We will focus on applications common in public health and biomedical research: measuring gene expression differences between populations, associated genomic variants to disease, measuring epigenetic marks such as DNA methylation, and transcription factor binding sites.

The course covers the necessary statistical concepts needed to properly design experiments and analyze the high dimensional data produced by these technologies. These include estimation, hypothesis testing, multiple comparison corrections, modeling, linear models, principal component analysis, clustering, nonparametric and Bayesian techniques. Along the way, students will learn to analyze data using the R programming language and several packages from the Bioconductor project.

Currently, biomedical research groups around the world are producing more data than they can handle. The training and skills acquired by taking this course will be of significant practical use for these groups. The learning that will take place in this course will allow for greater success in making biological discoveries and improving individual and population health.


If you've ever wanted to get started with data analysis in genomics and you'd learn R along the way, this looks like a great place to start. The course is set to start April 7, 2014.

HarvardX: Data Analysis for Genomics

Tuesday, February 11, 2014

There is no Such Thing as Biomedical "Big Data"

At the moment, the world is obsessed with “Big Data” yet it sometimes seems that people who use this phrase don’t have a good grasp of its meaning.  Like most good buzz-words, “Big Data” sparks the idea of something grand and complicated, while sounding ordinary enough that listeners feel like they have an intuitive understanding of the concept.  However "Big Data" actually carries a specific technical meaning which is getting lost as the term becomes more popular.    

The phrase's predecessor, "Data Mining" was equally misunderstood.  Originally called "database mining" (a subsequently trademarked term), the term "Data Mining" became common during the 1990s as many businesses rapidly adopted the use of relational database management systems (RDMS) such as Oracle.  RDMS store, optimize, and manage large amounts of data on physical disks for the purpose of rapid search, retrieval and update.  These large collections of data enabled businesses to extract new knowledge useful to their business practices by examining patterns within their data.  Data mining refers to a collection of algorithms that attempt to extract knowledge (in the form of rules or associations) from large amounts of data by processing it in place on the disk, either within the RDMS or within large flat files.  This is an important distinction, as the optimization and speed of algorithms that access data from the disk can be quite different from those which examine data within active memory.

A great example of data mining in practice is the use of frequent itemset mining to target customers with coupons and other discounts.  Ever wonder why you get a yogurt coupon at the register when you check out?  That’s because across thousands of other customers, a subgroup of people with shopping habits similar to yours consistently buys yogurt, and perhaps with a little prompting the grocery vendor can get you to consistently buy yogurt too.

As random access memory (RAM) prices dropped (and virtual memory procedures within operating systems improved), it became much more feasible to process even extremely large datasets within active memory, reducing the need for algorithmic refinements necessary for disk-based processing. But by then, "data mining" (marketed as a way to increase business profits) had become such a popular buzz-word, it was used to refer to any type of data analysis. 

 
In fact, it has been tacked on to numerous books and publications about machine learning methods purely for marketing purposes.  Supposedly even some publishers have modified the titles of machine learning books to include the phrase “data mining” with the hope that it will improve sales.  As a result, the colloquial meaning of data mining has become "a vaguely defined way to discover patterns in data".  To me, this is a tragedy because we have lost a degree of specificity in our language simply because “data mining” sounds cool and profitable.   

Zoom forward to the present day and we see history repeating with the phrase “big data”.  With the increasing popularity of cell phone technologies and the internet, the last decade has seen a dramatic growth in data generated by commercial transactions and online websites.   Many of these large companies (think Google's Search Indices or Facebook and LinkedIn network data) generate data on such a large scale that it cannot be managed within traditional database systems.  These groups have instead turned to large computing clusters that distribute the data over many, many separate machines and file systems.  Partitioning data in this way requires a new class of algorithms that can take advantage of the fact that individual processing units (or nodes of a computing cluster) house their own subset of the overall data.  This is a fundamental paradigm of “Big Data” algorithms, making it distinct from other machine learning and data mining techniques.

A great example of a “Big Data” programming model is the MapReduce framework developed by Google.  The basic idea is that any data manipulation step has a Map function that can be distributed over many, many nodes on a computing cluster that then filter and sort their own portion of the data.  A Reduce function is then performed that combines the selected and sorted data entries into a summary value.  This model is implemented by the popular Big Data system Apache Hadoop.

All of the fuss over “Big Data” is driven by these massive producers of data (on the order of hundreds of terabytes a day), yet the ideas behind “Big Data” are being applied on much smaller datasets even when they are not necessary.  In fact, a rather amusing read from Microsoft Research describes the overhype of “Big Data” algorithms and the surprisingly few analytic operations that truly need these approaches.  The hype is alive and well in the medical and biological research community as well.  In fact, there is an NIH initiative to fund “Big Data to Knowledge”.  I'm the first to cheer for projects dedicated to large-scale data analysis, but by nearly any definition, right now there is no such thing as biomedical Big Data.     

There are certainly processes in biomedical research that produce large amounts of data – first among them is next generation sequencing technology.  In sequencing studies, the raw data from sequencers is aligned and processed to extract the meaningful information (i.e. SNP and CNV calls).  After processing, a full human genome will nearly fit on a floppy disk, which hardly qualifies as "Big Data".  While there may be some interest in storing the raw underlying data (sequence reads), it may prove much more cost effective to simply regenerate the data.  Based on an excellent analysis by Glenn Lockwood, storing four weeks worth of HiSeq X10 raw data may cost nearly $10,000 a month.  If instead we store derived features from the raw data, data storage and manipulation is on the order of typical imputed GWAS.  There will undoubtedly be a desire to reprocess this raw sequence data with new algorithms, but unless storage prices drop rapidly, regenerating the data will be more cost-effective than storage.  Therefore, in my opinion right now the closest thing to qualifying as Big Data would be large multi-center electronic medical record systems, yet even these are typically managed by large-scale relational database systems.

So in practice, our grants will be filled with mentions of Big Data, Web 3.0, "thinking outside the box", value added, hype/innovation, but in reality biomedical sciences are nowhere near approaching the scale needed for real Big Data approaches.    

Thanks to Alex Fish for her thoughtful edits.

Thursday, January 30, 2014

GNU Screen

This is one of those things I picked up years ago while in graduate school that I just assumed everyone else already knew about. GNU screen is a great utility built-in to most Linux installations for remote session management. Typing 'screen' at the command line enters a new screen session. Once launched, you can start processes in the screen session, detach the session with Ctrl-a+d, then reattach at a later point and resume where you left off. See this screencast I made below:

Wednesday, January 22, 2014

Coursera Specializations: Data Science, Systems Biology, Python Programming

I first mentioned Coursera about a year ago, when I hired a new analyst in my core. This new hire came in as a very competent Python programmer with a molecular biology and microbial ecology background, but with very little experience in statistics. I got him to take Roger Peng's Computing for Data Analysis course and Jeff Leek's Data Analysis course, and four weeks later he was happily doing statistical analysis in R for gene expression experiments for several of our clients.

Today, Coursera announced Specializations - sequences of courses offered by the same institution, with the option of earning a specialization certificate from the University teaching the courses upon successful completion.

Among others, several specializations that look particularly interesting are:

Johns Hopkins University's Data Science Specialization

This course, one of the longer specializations, is taught by Brian Caffo, Roger Peng, and Jeff Leek at Johns Hopkins. The courses in the specialization include:


  • The Data Scientist’s Toolbox
  • R Programming
  • Getting and Cleaning Data
  • Exploratory Data Analysis
  • Reproducible Research
  • Statistical Inference
  • Regression Models
  • Practical Machine Learning
  • Developing Data Products
  • A final Capstone Project



  • Systems Biology (Icahn School of Medicine at Mount Sainai)

    Courses include:


  • Introduction to Systems Biology
  • Network Analysis in Systems Biology
  • Dynamical Modeling Methods for Systems Biology
  • Integrated Analysis in Systems Biology
  • A final Capstone Project



  • Fundamentals of Computing (Rice University)

    Courses include:


  • An Introduction to Interactive Programming in Python
  • Principles of Computing
  • Algorithmic Thinking
  • A final Capstone Project



  • Check out the Coursera Specializations page for other Coursera series.

    Monday, January 13, 2014

    How To Install BioPerl Without Root Privileges

    I've seen this question asked and partially answered all around the web. As with anything related to Perl, I'm sure there is more than one way to do it. Here's how I do it with Perl 5.10.1 on CentOS 6.4.

    First, install local::lib with bootstrapping method as described here.





    Next, put this in your .bashrc so that it's executed every time you log in:



    Log out then log back in, then download and install BioPerl, answering "yes" to any question asking you to download and install dependencies when necessary:



    Tuesday, December 31, 2013

    Jeff Leek's non-comprehensive list of awesome things other people did in 2013

    Jeff Leek, biostats professor at Johns Hopkins and instructor of the Coursera Data Analysis course, recently posted on Simly Statistics this list of awesome things other people accomplished in 2013 in genomics, statistics, and data science.

    At risk of sounding too meta, I'll say that this list itself is one of the awesome things that was put together in 2013. You should go browse the entire post for yourself, but I'll highlight a few that I saved to my reading list:


    This only a sample of what's posted on Jeff's blog. Go read the full post below.

    Simply Statistics: A non-comprehensive list of awesome things other people did this year.

    Wednesday, December 18, 2013

    Curoverse raises $1.5M to develop & support an open-source bioinformatics data analysis platform



    Boston-based startup Curoverse has announced $1.5 million in funding to develop and support the open-source Arvados platform for cloud-based bioinformatics & genomics data analysis.

    The Arvados platform was developed in George Church's lab by scientists and engineers led by Alexander Wait Zaranek, now scientific director at Curoverse. According to the Arvados wiki:

    Arvados is a platform for storing, organizing, processing, and sharing genomic and other biomedical big data. The platform is designed to make it easier for bioinformaticians to develop analyses, developers to create genomic web applications and IT administers to manage large-scale compute and storage genomic resources. The platform is designed to run on top of "cloud operating systems" such as Amazon Web Services and OpenStack. Currently, there are implementations that work on AWS and Xen+Debian/Ubuntu. ... A set of relatively low-level compute and data management functions are consistent across a wide range of analysis pipelines and applications that are being built for genomic data. Unfortunately, every organization working with these data have been forced to build their own custom systems for these low level functions. At the same time, there are proprietary platforms emerging that seek to solve these same problems. Arvados was created to provide a common solution across a wide range of applications that would be free and open source.

    A few questions should be apparent: What value does Arvados provide over other more widely used platforms (namely, Galaxy) that also aim to enable reproducibility, transparency, sharing, collaboration, and data/workflow management with biological big data? And how does Curoverse distinguish itself from other cloud-based bioinformatics services like Seven Bridges, DNA Nexus, and the next implement-GATK-on-Amazon-and-sell-it-back-to-me service provider that pops up? I understand that there are real costs with free software, but will the service that Curoverse provides be valuable and cost-effective enough to overcome the activation energy and make up for the "switching costs" that the average bioinformatician faces on adopting a new way of doing things? While the platform and the support model sound potentially very useful, these are all questions that the Curoverse team will need to carefully consider in attracting new users.

    Arvados open-source bioinformatics analysis platform: https://arvados.org/

    Curoverse: https://curoverse.com/

    Press Release: http://www.prweb.com/releases/2013/12/prweb11424292.htm

    Monday, December 9, 2013

    Biostar Tutorial: Cheat sheet for one-based vs zero-based coordinate systems

    Obi Griffith over at Biostar put together this excellent cheat sheet for dealing with one-based and zero-based genomic coordinate systems. The cheat sheet visually explains the difference between zero and one-based coordinate systems, as well as how to indicate a position, SNP, range, or indel using both coordinate systems.

    Biostar Tutorial: Cheat sheet for one-based vs zero-based coordinate systems


    Creative Commons License
    Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.