Tuesday, February 11, 2014

There is no Such Thing as Biomedical "Big Data"

At the moment, the world is obsessed with “Big Data” yet it sometimes seems that people who use this phrase don’t have a good grasp of its meaning.  Like most good buzz-words, “Big Data” sparks the idea of something grand and complicated, while sounding ordinary enough that listeners feel like they have an intuitive understanding of the concept.  However "Big Data" actually carries a specific technical meaning which is getting lost as the term becomes more popular.    

The phrase's predecessor, "Data Mining" was equally misunderstood.  Originally called "database mining" (a subsequently trademarked term), the term "Data Mining" became common during the 1990s as many businesses rapidly adopted the use of relational database management systems (RDMS) such as Oracle.  RDMS store, optimize, and manage large amounts of data on physical disks for the purpose of rapid search, retrieval and update.  These large collections of data enabled businesses to extract new knowledge useful to their business practices by examining patterns within their data.  Data mining refers to a collection of algorithms that attempt to extract knowledge (in the form of rules or associations) from large amounts of data by processing it in place on the disk, either within the RDMS or within large flat files.  This is an important distinction, as the optimization and speed of algorithms that access data from the disk can be quite different from those which examine data within active memory.

A great example of data mining in practice is the use of frequent itemset mining to target customers with coupons and other discounts.  Ever wonder why you get a yogurt coupon at the register when you check out?  That’s because across thousands of other customers, a subgroup of people with shopping habits similar to yours consistently buys yogurt, and perhaps with a little prompting the grocery vendor can get you to consistently buy yogurt too.

As random access memory (RAM) prices dropped (and virtual memory procedures within operating systems improved), it became much more feasible to process even extremely large datasets within active memory, reducing the need for algorithmic refinements necessary for disk-based processing. But by then, "data mining" (marketed as a way to increase business profits) had become such a popular buzz-word, it was used to refer to any type of data analysis. 

In fact, it has been tacked on to numerous books and publications about machine learning methods purely for marketing purposes.  Supposedly even some publishers have modified the titles of machine learning books to include the phrase “data mining” with the hope that it will improve sales.  As a result, the colloquial meaning of data mining has become "a vaguely defined way to discover patterns in data".  To me, this is a tragedy because we have lost a degree of specificity in our language simply because “data mining” sounds cool and profitable.   

Zoom forward to the present day and we see history repeating with the phrase “big data”.  With the increasing popularity of cell phone technologies and the internet, the last decade has seen a dramatic growth in data generated by commercial transactions and online websites.   Many of these large companies (think Google's Search Indices or Facebook and LinkedIn network data) generate data on such a large scale that it cannot be managed within traditional database systems.  These groups have instead turned to large computing clusters that distribute the data over many, many separate machines and file systems.  Partitioning data in this way requires a new class of algorithms that can take advantage of the fact that individual processing units (or nodes of a computing cluster) house their own subset of the overall data.  This is a fundamental paradigm of “Big Data” algorithms, making it distinct from other machine learning and data mining techniques.

A great example of a “Big Data” programming model is the MapReduce framework developed by Google.  The basic idea is that any data manipulation step has a Map function that can be distributed over many, many nodes on a computing cluster that then filter and sort their own portion of the data.  A Reduce function is then performed that combines the selected and sorted data entries into a summary value.  This model is implemented by the popular Big Data system Apache Hadoop.

All of the fuss over “Big Data” is driven by these massive producers of data (on the order of hundreds of terabytes a day), yet the ideas behind “Big Data” are being applied on much smaller datasets even when they are not necessary.  In fact, a rather amusing read from Microsoft Research describes the overhype of “Big Data” algorithms and the surprisingly few analytic operations that truly need these approaches.  The hype is alive and well in the medical and biological research community as well.  In fact, there is an NIH initiative to fund “Big Data to Knowledge”.  I'm the first to cheer for projects dedicated to large-scale data analysis, but by nearly any definition, right now there is no such thing as biomedical Big Data.     

There are certainly processes in biomedical research that produce large amounts of data – first among them is next generation sequencing technology.  In sequencing studies, the raw data from sequencers is aligned and processed to extract the meaningful information (i.e. SNP and CNV calls).  After processing, a full human genome will nearly fit on a floppy disk, which hardly qualifies as "Big Data".  While there may be some interest in storing the raw underlying data (sequence reads), it may prove much more cost effective to simply regenerate the data.  Based on an excellent analysis by Glenn Lockwood, storing four weeks worth of HiSeq X10 raw data may cost nearly $10,000 a month.  If instead we store derived features from the raw data, data storage and manipulation is on the order of typical imputed GWAS.  There will undoubtedly be a desire to reprocess this raw sequence data with new algorithms, but unless storage prices drop rapidly, regenerating the data will be more cost-effective than storage.  Therefore, in my opinion right now the closest thing to qualifying as Big Data would be large multi-center electronic medical record systems, yet even these are typically managed by large-scale relational database systems.

So in practice, our grants will be filled with mentions of Big Data, Web 3.0, "thinking outside the box", value added, hype/innovation, but in reality biomedical sciences are nowhere near approaching the scale needed for real Big Data approaches.    

Thanks to Alex Fish for her thoughtful edits.


  1. You might want to look this site "http://freeman-lab.github.io/thunder/" for real time many tera-byte-level neural signal processing, which is biomedical "big data". Surely those data can be processed offline with a single computer over hours to months. But that's too slow. Scientists needs real-time processing to do efficient experiment design.

    1. BTW, this is not rare situation. This field is exploding.

  2. There's more to biomedical data than mere SNP interpretation. The human microbiome for example -- and realize that while right now people are mostly focusing on the bacterial component, there are many small eukaryotes living in and on us as well -- and many of them (amoebas for example), actually have larger genomes than humans...

  3. A couple of comments on this:

    1). It is always useful to know where certain terms come from and their original meaning. However, as we are all aware, English is a living language and is constantly changing and evolving. So original meanings are often lost when a term becomes widely used. Trying to revert usage to the original is a lost cause; we just have to go with the flow on that. So, as much as it pains me to hear people "flush out an idea" when they really mean to "flesh out an idea", I'm not going to change that unfortunate trend.

    2). As for the larger issue of biomedical data not really being big data, I think this is truly arguable. As is pointed out, one can distill much of a genome sequence into a quite small file, but this does eliminate a tremendous amount of potential additional information. The alternative of regenerating the raw data has its potential advantages, but may not be possible for many reasons (loss of original sample, high preparation costs, etc.); in which case storing the original data at whatever cost is appropriate.

    It is also becoming more clear that any single person does not have a single genome. There are a large number of somatic mutations throughout the body, which we are only now beginning to appreciate. So, how many whole genomes do we need to do for any one person? When you start talking about multiple genomes per person and extending this to whole populations, I think you get to big data in the original sense. Additionally, other type of biomedical data (e.g. gene expression profiles, proteomic profiles, microbiome profiles) are time, tissue, and perhaps even cell specific). Add that to the multiple genomes and you are really into the realm of big data. True, much of this is not being done today, but it is coming and we might as well prepare for it...e.g. be thinking in big data terms.


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.