Wednesday, January 22, 2014

Coursera Specializations: Data Science, Systems Biology, Python Programming

I first mentioned Coursera about a year ago, when I hired a new analyst in my core. This new hire came in as a very competent Python programmer with a molecular biology and microbial ecology background, but with very little experience in statistics. I got him to take Roger Peng's Computing for Data Analysis course and Jeff Leek's Data Analysis course, and four weeks later he was happily doing statistical analysis in R for gene expression experiments for several of our clients.

Today, Coursera announced Specializations - sequences of courses offered by the same institution, with the option of earning a specialization certificate from the University teaching the courses upon successful completion.

Among others, several specializations that look particularly interesting are:

Johns Hopkins University's Data Science Specialization

This course, one of the longer specializations, is taught by Brian Caffo, Roger Peng, and Jeff Leek at Johns Hopkins. The courses in the specialization include:


  • The Data Scientist’s Toolbox
  • R Programming
  • Getting and Cleaning Data
  • Exploratory Data Analysis
  • Reproducible Research
  • Statistical Inference
  • Regression Models
  • Practical Machine Learning
  • Developing Data Products
  • A final Capstone Project



  • Systems Biology (Icahn School of Medicine at Mount Sainai)

    Courses include:


  • Introduction to Systems Biology
  • Network Analysis in Systems Biology
  • Dynamical Modeling Methods for Systems Biology
  • Integrated Analysis in Systems Biology
  • A final Capstone Project



  • Fundamentals of Computing (Rice University)

    Courses include:


  • An Introduction to Interactive Programming in Python
  • Principles of Computing
  • Algorithmic Thinking
  • A final Capstone Project



  • Check out the Coursera Specializations page for other Coursera series.

    Monday, January 13, 2014

    How To Install BioPerl Without Root Privileges

    I've seen this question asked and partially answered all around the web. As with anything related to Perl, I'm sure there is more than one way to do it. Here's how I do it with Perl 5.10.1 on CentOS 6.4.

    First, install local::lib with bootstrapping method as described here.





    Next, put this in your .bashrc so that it's executed every time you log in:



    Log out then log back in, then download and install BioPerl, answering "yes" to any question asking you to download and install dependencies when necessary:



    Tuesday, December 31, 2013

    Jeff Leek's non-comprehensive list of awesome things other people did in 2013

    Jeff Leek, biostats professor at Johns Hopkins and instructor of the Coursera Data Analysis course, recently posted on Simly Statistics this list of awesome things other people accomplished in 2013 in genomics, statistics, and data science.

    At risk of sounding too meta, I'll say that this list itself is one of the awesome things that was put together in 2013. You should go browse the entire post for yourself, but I'll highlight a few that I saved to my reading list:


    This only a sample of what's posted on Jeff's blog. Go read the full post below.

    Simply Statistics: A non-comprehensive list of awesome things other people did this year.

    Wednesday, December 18, 2013

    Curoverse raises $1.5M to develop & support an open-source bioinformatics data analysis platform



    Boston-based startup Curoverse has announced $1.5 million in funding to develop and support the open-source Arvados platform for cloud-based bioinformatics & genomics data analysis.

    The Arvados platform was developed in George Church's lab by scientists and engineers led by Alexander Wait Zaranek, now scientific director at Curoverse. According to the Arvados wiki:

    Arvados is a platform for storing, organizing, processing, and sharing genomic and other biomedical big data. The platform is designed to make it easier for bioinformaticians to develop analyses, developers to create genomic web applications and IT administers to manage large-scale compute and storage genomic resources. The platform is designed to run on top of "cloud operating systems" such as Amazon Web Services and OpenStack. Currently, there are implementations that work on AWS and Xen+Debian/Ubuntu. ... A set of relatively low-level compute and data management functions are consistent across a wide range of analysis pipelines and applications that are being built for genomic data. Unfortunately, every organization working with these data have been forced to build their own custom systems for these low level functions. At the same time, there are proprietary platforms emerging that seek to solve these same problems. Arvados was created to provide a common solution across a wide range of applications that would be free and open source.

    A few questions should be apparent: What value does Arvados provide over other more widely used platforms (namely, Galaxy) that also aim to enable reproducibility, transparency, sharing, collaboration, and data/workflow management with biological big data? And how does Curoverse distinguish itself from other cloud-based bioinformatics services like Seven Bridges, DNA Nexus, and the next implement-GATK-on-Amazon-and-sell-it-back-to-me service provider that pops up? I understand that there are real costs with free software, but will the service that Curoverse provides be valuable and cost-effective enough to overcome the activation energy and make up for the "switching costs" that the average bioinformatician faces on adopting a new way of doing things? While the platform and the support model sound potentially very useful, these are all questions that the Curoverse team will need to carefully consider in attracting new users.

    Arvados open-source bioinformatics analysis platform: https://arvados.org/

    Curoverse: https://curoverse.com/

    Press Release: http://www.prweb.com/releases/2013/12/prweb11424292.htm

    Monday, December 9, 2013

    Biostar Tutorial: Cheat sheet for one-based vs zero-based coordinate systems

    Obi Griffith over at Biostar put together this excellent cheat sheet for dealing with one-based and zero-based genomic coordinate systems. The cheat sheet visually explains the difference between zero and one-based coordinate systems, as well as how to indicate a position, SNP, range, or indel using both coordinate systems.

    Biostar Tutorial: Cheat sheet for one-based vs zero-based coordinate systems


    Wednesday, November 20, 2013

    Using Database Joins to Compare Results Sets

    One of the most powerful tools you can learn to use in genomics research is a relational database system, such as MySQL.  These systems are fairly easy to setup and use, and provide users the ability to organize and manipulate data and statistical results with simple commands.  As a graduate student (during the height of GWAS), this single skill quickly turned me into an “expert”.  By storing the SNP lists for common GWAS platforms and some simple annotations from the UCSC and ENSEMBL databases, I could quickly provide lists of SNPs within a gene or collection of genes, or pull a list of SNPs that overlap two genotyping platforms.  We even developed database modules that allowed us to easily define LD blocks within a database query (called LD-Spline).

    Once you learn the basics of defining tables and loading data, you can start to join tables together, matching them on a common field.  This is where the true power of a database system lies.  Suppose you have two sets of results from a PLINK analysis, one from a discovery dataset and another from a replication.  Rather than clumsily matching two sets of results within a spreadsheet application, a few simple queries within MySQL will tell you which SNPs are in common between the two sets, which were not found in the replication set, which SNPs were significant in the first set but not the second, etc.  

    The concept that makes these operations work is the idea of a primary key.  A primary key is some field of a dataset that uniquely identifies each row of the table/dataset.  In the above example of PLINK results, a good primary key might be the RS number of the SNP.  You can also uniquely identify rows based on two columns, a concept known as a composite key – for example, the chromosome AND position of a SNP.  Establishing a primary key allows MySQL to keep data stored in a sorted order and allows the matching operations for table joins to be performed much faster. 

    Having this sorted order from a primary key prevents MySQL from having to scan an entire table to find a specific value.  Much like the index of a book, a primary key lets MySQL find a value within a table very quickly.  If a table is small, having a primary key is not as critical; the computer can quickly scan the entire contents of the table for any query.  If the table is large, however, a full scan of the entire table could be a costly operation, and the number of table scans required increases when doing a join.  For example, if we join tables for our discovery and replication results sets, the database system will take the RS number for each entry from the discovery table and attempt to find a matching RS number in the replication table.  If the replication table has the RS number as a primary key, the database system can very quickly find this entry. There is a fantastic post on the various types of database joins here.

    Let's start by creating our database tables.  A typical PLINK association output contains 12 columns (CHR, SNP, BP, A1, TEST, NMISS, OR, SE, L95, U95, STAT, P).  In these tables, we've established the SNP column as the primary key.  Recall that the primary key must uniquely identify each row of the table, so if there are multiple rows per SNP -- sometimes PLINK will report multiple TEST rows per SNP.  If this is the case, we may need to either establish a composite key using PRIMARY KEY (`snp`,`test`), or simply eliminate these rows from the data file using an AWK command.

    CREATE TABLE `discovery` (
     `chr` varchar(1),
            `snp` varchar(32),
            `bp` int, 
            `a1` varchar(1),
            `test` varchar(3),
            `nmiss` int,
            `or` float,
            `se` float,
            `l95` float,
            `u95` float,
            `stat` float,
     `p` float,
     PRIMARY KEY (`snp`)
    );
    
    CREATE TABLE `replication` (
           `chr` varchar(1),
           `snp` varchar(32),
           `bp` int,
           `a1` varchar(1),
           `test` varchar(3),
           `nmiss` int,
           `or` float,
           `se` float,
           `l95` float,
           `u95` float,
           `stat` float,
           `p` float,
           PRIMARY KEY (`snp`)
    );
    

    Before loading our data into these tables, a little pre-processing is helpful.  To ensure that results are easy to read on the screen, PLINK developers used leading spaces in the column format for many PLINK outputs.  These make loading the results into a database difficult.  We can resolve this by running a simple SED command:
    sed -r -e 's/\s+/\t/' -e 's/^\t//g' input-file.assoc.logistic > discovery.load
    This will convert all spaces to tabs and will eliminate the leading spaces and write the results to a new file, discovery.load.  Now lets load this file into our table, and repeat the procedure for our replication file.

    LOAD DATA LOCAL INFILE '{PathToFile}/discovery.load' INTO TABLE 
    discovery FIELDS TERMINATED BY '\t' IGNORE 1 LINES;
    Now we should have two MySQL database tables with the discovery and results sets loaded into them.  We can view their contents with a simple select statement.  Then, finally, we can join these two tables to easily compare the results from the discovery and replication analyses.


    SELECT * FROM discovery INNER JOIN replication ON 
    discovery.snp = replication.snp;
    The syntax is simple: select a set of fields -- in this case all of them (represented by the *) -- from the first table (discovery), matching each row from this table to a row in the second table (replication) where the discovery SNP equals the replication SNP.  MySQL also supports a table alias which can make these queries a bit easier to write.  An alias is simply a label specified after a table name which can be used in the rest of the query in place of the full table name.  For example, in the query below, we use a for the discovery table and b for the replication table.

    SELECT * FROM discovery a INNER JOIN replication b ON 
    a.snp = b.snp;
    With practice and additional data, join operations can be used to annotate results by gene or region, and to match these to results from other studies, such as the NHGRI GWAS catalog.


    Wednesday, November 6, 2013

    A Mitochondrial Manhattan Plot

    Manhattan plots have become the standard way to visualize results for genetic association studies, allowing the viewer to instantly see significant results in the rough context of their genomic position.  Manhattan plots are typically shown on a linear X-axis (although the circos package can be used for radial plots), and this is consistent with the linear representation of the genome in online genome browsers.  Many genetic studies often overlook the other genome within all our cells, the mitochondrial genome. This circular molecule has been shown to be associated (albeit inconsistently) with many disease traits, and functional variants from this genome are now included in most genotyping platforms.

    Thanks to the clever work of several graduate students in the Vanderbilt Center for Human Genetics Research (most notably Ben Grady), mitochondrial genetic associations can be visualized in the context of key regions of the mitochondrial genome using a radial plot in the R package ggplot2.

    To make this plot, download the code and run the script (alternatively open the script in R and run interactively):

    On the command line:

    git clone git@github.com:stephenturner/solarplot.git
    cd solarplot
    R CMD BATCH solarplot.R


    GitHub: Mitochondrial solar plot code and data

    Monday, November 4, 2013

    Archival and analysis of #GI2013 Tweets

    I archived and analyzed all Tweets containing #GI2013 from the recent Cold Spring Harbor Genome Informatics meeting, using my previously described code.

    Friday was the most Tweeted day. Perhaps this was due to Lior Pachter's excellent keynote, "Stories from the Supplement."

    There was clearly a bimodal distribution in the Tweets by hour, corresponding to the morning and evening sessions:

    Top hashtags used other than #GI2013:

    Most prolific users:

    And of course, the much-maligned word cloud:

    Archive of all #GI2013 Tweets

    R code for analysis
    Creative Commons License
    Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.