Getting Genetics Done: Machine Learning

Showing posts with label Machine Learning. Show all posts

Friday, April 10, 2015

R User Group Recap: Heatmaps and Using the caret Package

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.

library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
         annRow = iris["Species"], annColors = "Set2", 
         distfun = "pearson", treeheight=c(200, 50), 
         fontsize=13, cexCol=.7, 
         filename="heatmap.png", width=8, height=16)

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.

First, get a look at the data with the featurePlot function in the caret package:

library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
            y = iris$Species,
            plot = "pairs",
            auto.key = list(columns = 3))

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:

	setosa	versicolor	virginica
Sensitivity	1.00	1.00	0.00
Specificity	1.00	0.50	1.00
Pos Pred Value	1.00	0.50	NaN
Neg Pred Value	1.00	1.00	0.67
Prevalence	0.33	0.33	0.33
Detection Rate	0.33	0.33	0.00
Detection Prevalence	0.33	0.67	0.00
Balanced Accuracy	1.00	0.75	0.50

A big thanks to Mark and Steve at Hemoshear for putting this together!

Wednesday, March 27, 2013

Evolutionary Computation and Data Mining in Biology

For over 15 years, members of the computer science, machine learning, and data mining communities have gathered in a beautiful European location each spring to share ideas about biologically-inspired computation. Stemming from the work of John Holland who pioneered the field of genetic algorithms, multiple approaches have been developed that exploit the dynamics of natural systems to solve computational problems. These algorithms have been applied in a wide variety of fields, and to celebrate and cross-pollinate ideas from these various disciplines the EvoStar event co-locates five conferences at the same venue, covering genetic programming (EuroGP), combinatorial optimization (EvoCOP), music, art, and design (EvoMUSART), multidisciplinary applications (EvoApplications), and computational biology (EvoBIO). EvoStar 2013 will be held in Vienna, Austria on April 3-5, and is always expertly coordinated by the wonderful Jennifer Willies from Napier University, UK. Multiple research groups from the US and Europe will attend to present their exciting work in these areas.

Many problems in bioinformatics and statistical analysis use what are considered “greedy” algorithms to fit parameters to data – that is, they settle on a nearby collection of parameters as the solution and potentially miss a global best solution. This problem is well-known in the computer science community for toy problems like bin packing or the knapsack problem. In human genetics, related problems are partitioning complex pedigrees or selecting maximally unrelated individuals from a dataset, and can also appear when maximizing likelihood equations.

EvoBIO focuses on using biologically-inspired algorithms (like genetic algorithms) to improve performance for many bioinformatics tasks. For example, Stephen and I have both applied these methods for analysis of genetic data using neural networks, and for forward-time genetic data simulation (additional details here).

EvoBIO is very pleased to be sponsored by BMC Biodata Mining, a natural partner for this conference. I recently wrote a blog post for BioMed Central about EvoBIO as well. Thanks to their sponsorship, the winner of the EvoBIO best paper award will receive free publication in Biodata Mining, and runners-up will receive 25% discount off the article processing charge.

So, if you are in the mood for a new conference and would like to see and influence some of these creative approaches to data analysis, consider attending EvoSTAR -- We'd love to see you there!

Tuesday, October 18, 2011

My thoughts on ICHG 2011

I’m a bit exhausted from a week of excellent science at ICHG. First, let me say that Montreal is a truly remarkable city with fantastic food and a fascinating blend of architectural styles, all making the meeting a fun place to be…. Now on to the genomics – I’ll recap a few of the most exciting sessions I attended. You can find a live-stream of tweets from the meeting by searching the #ICHG2011 and #ICHG hashtags.

On Wednesday, Marylyn Ritchie(@MarylynRitchie) and Nancy Cox organized “Beyond Genome-wide association studies”. Nancy Cox presented some ideas on how to integrate multiple “intermediate” associations for SNPs, such as expression QTLs and newly discovered protein QTLs (More on pQTLs later). This approach which she called a Functional Unit Analysis would group signals together based on the genes they influence. Nicholas Shork presented some nice examples of pros and cons of sequence level annotation algorithms. Trey Idekker gave a very nice talk illustrating some of the properties of epistasis in yeast protein interaction networks. One of the more striking points he made was that epistasis tends to occur between full protein complexes rather than within elements of the complexes themselves. Marylyn Ritchie presented the ideas behind her ATHENA software for machine learning analysis of genetic data, and Manuel Mattesian from Tim Becker’s group presented the methods in their INTERSNP software for doing large-scale interaction analysis. What was most impressive with this session is that there were clear attempts to incorporate underlying biological complexity into data analysis.

On Thursday, I attended the second Statistical Genetics section called “Expanding Genome-wide Association Studies”, organized by Saurabh Ghosh and Daniel Shriner. Having recently attended IGES, I feel pretty “up” on newer analysis techniques, but this session had a few talks that sparked my interest. The first three talks were related to haplotype phasing and the issues surrounding computational accuracy and speed. The basic goal of all these methods is to efficiently estimate genotypes for a common set of loci for all samples of a study using a set of reference haplotypes, usually from the HapMap or 1000 genomes data. Despite these advances, it seems like phasing haplotypes for thousands of samples is still a massive undertaking that requires a high-performance computing cluster. There were several talks about ongoing epidemiological studies, including the Kaiser Permanente UCSF cohort. Neil Risch presented an elegant study design implementing four custom GWAS chips for the four targeted populations. Looks like the data hasn't started to flow from this yet, but when it does we’re sure to learn about lots of interesting ethnic-specific disease effects. My good friend and colleague Dana Crawford presented an in silico GWAS study of hypothyroidism. In her best NPR voice, Dana showed how electronic medical records with GWAS data in the EMERGE network can be re-used to construct entirely new studies nested within the data collected for other specific disease purposes. Her excellent Post-Doc, Logan Dumitrescu presented several gene-environment interactions between Lipid levels and vitamin A and E from Dana’s EAGLE study. Finally Paul O’Reilly presented a cool new way to look at multiple phenotypes by essentially flipping a typical regression equation around, estimating coefficients that relate each phenotype in a study to a single SNP genotype as an outcome. This rather clever approach called MultiPhen is similar to log-linear models I’ve seen used for transmission-based analysis, and allows you to model the “interaction” among phenotypes in much the same way you would look at SNP interactions.

By far the most interesting talks of the meeting (for me) were in the Genomics section on Gene Expression, organized by Tomi Pastinen and Mark Corbett. Chris Mason started the session off with a fantastic demonstration of the power of RNA-seq. Examining transcriptomes of 14 non-human primate species, they validated many of the computational predictions in the AceView gene build, and illustrated that most “exome” sequencing is probably examining less than half of all transcribed sequences. Rupali Patwardhan talked about a system for examining the impact of promoter and enhancer mutations in whole mice, essentially using mutagenesis screens to localize these regions. Ron Hause presented work on the protein QTLs that Nancy Cox alluded to earlier in the conference. Using a high-throughput form of western blots, they systematically examined levels for over 400 proteins in the Yoruba HapMap cell lines. They also illustrate that only about 50% of eQTLs identified in these lines actually alter protein levels. Stephen Montgomery spoke about the impact of rare genetic variants within a transcript on transcript levels. Essentially he showed an epistatic effect on expression, where transcripts with deleterious alleles are less likely to be expressed – an intuitive and fascinating finding, especially for those considering rare-variant analysis. Athma Pai presented a new QTL that influences mRNA decay rates. By measuring multiple time points using RNA-seq, she found individual-level variants that alter decay, which she calls dQTLs. Veronique Adoue looked at cis-eQTLs relative to transcription factor binding sites using ChIP, and Alfonso Buil showed how genetic variants influence gene expression networks (or correlation among gene expression) across tissue types.

I must say despite all the awesome work presented in this session, Michael Snyder stole the show with his talk on the “Snyderome” – his own personal –omics profile collected over 21 months. His whole-genome was sequenced by Complete Genomics, and processed using Rong Chen and Atul Butte’s risk-o-gram to quantify his disease risk. His profile predicted increased risk of T2D, so he began collecting glucose measures and low and behold, he saw a sustained spike in blood glucose levels following a few days following a common cold. His interpretation was that an environmental stress knocked him into a pseudo-diabetic state, and his transcriptome and proteome results corroborated this idea. Granted, this is an N of 1, and there is still lots of work to be done before this type of analysis revolutionizes medicine, but the take home message is salient – multiple -omics are better than one, and everyone’s manifestation of a complex disease is different. This was truly thought-provoking work, and it nicely closed an entire session devoted to understanding the intermediate impact of genetic variants to better understand disease complexity.

This is just my take of a really great meeting -- I'm sure I missed lots of excellent talks. If you saw something good please leave a comment and share!

Tuesday, March 8, 2011

Splitting a Dataset Revisited: Keeping Covariates Balanced Between Splits

In my previous post I showed you how to randomly split up a dataset into training and testing datasets. (Thanks to all those who emailed me or left comments letting me know that this could be done using other means. As things go with R, it's sometimes easier to write a new function yourself than it is to hunt down the function or package that already exists.)

What if you wanted to split a dataset into training/testing sets but ensure that there are no significant differences between a variable of interest across the two splits?

For example, if we use the splitdf() function from last time to split up the iris dataset, setting the random seed to 44, it turns out the outcome variable of interest, Sepal.Length, differs significantly between the two splits.

splitdf <- function(dataframe, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
    index <- 1:nrow(dataframe)
    trainindex <- sample(index, trunc(length(index)/2))
    trainset <- dataframe[trainindex, ]
    testset <- dataframe[-trainindex, ]
    list(trainset=trainset,testset=testset)
}

data(iris)
s44 <- splitdf(iris, seed=44)
train <- s1$trainset
test <- s1$testset
t.test(train$Sepal.Length, test$Sepal.Length)

What if we wanted to ensure that the means of Sepal.Length, as well as the other continuous variables in the dataset, do not differ between the two splits?

Again, this is probably something that's already available in an existing package, but I quickly wrote another function to do this. It's called splitdf.randomize(), which depends on splitdf() from before. Here, you give splitdf.randomize() your data frame you want to split, and a character vector containing all the columns you want to keep balanced between the two splits. The function is a wrapper for splitdf(). It randomly makes a split and does a t-test on each column you specify. If the p-value on that t-test is less than 0.5 (yes, 0.5, not 0.05), then the loop will restart and try splitting the dataset again. (Currently this only works with continuous variables, but if you wanted to extend this to categorical variables, it wouldn't be hard to throw in a fisher's exact test in the while loop)

For each iteration, the function prints out the p-value for the t-test on each of the variable names you supply. As you can see in this example, it took four iterations to ensure that all of the continuous variables were evenly distributed among the training and testing sets. Here it is in action:

Thursday, February 24, 2011

Split a Data Frame into Testing and Training Sets in R

I recently analyzed some data trying to find a model that would explain body fat distribution as predicted by several blood biomarkers. I had more predictors than samples (p>n), and I didn't have a clue which variables, interactions, or quadratic terms made biological sense to put into a model.

I then turned to a few data mining procedures that I learned about during grad school but never really used (LASSO, Random Forest, support vector machines, etc). So far, Random Forest is working unbelievably well. The boostrapping and aggregation ("bagging," i.e. the random component of Random Forest) avoids overfitting so well that I'm able to explain about 80% of the variation in an unseen sample using a model derived from only 30 training samples. (This paper offers the best explanation of Random Forest I've come across).

While doing this I needed to write an R function to split up a dataset into training and testing sets so I could train models on one half and test them on unseen data. I'm sure a function already exists to do something similar, but it was trivial enough to write a function to do it myself.

This function takes a data frame and returns two dataframes (as a list), one called trainset, one called testset.

splitdf <- function(dataframe, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
    index <- 1:nrow(dataframe)
    trainindex <- sample(index, trunc(length(index)/2))
    trainset <- dataframe[trainindex, ]
    testset <- dataframe[-trainindex, ]
    list(trainset=trainset,testset=testset)
}

In R, you can generally fit a model doing something like this:

mymodel <- method(y~x, data=mydata)

...and then predict the outcome for new data using the generic predict function:

predvals <- predict(mymodel, newdataframe)

Here's some R code that uses the built in iris data, splits the dataset into training and testing sets, and develops a model to predict sepal length based on every other variable in the dataset using Random Forest.

*Edit 2011-02-25* Thanks for all the comments. Clearly the split() function does something very similar to this, and the createDataPartition() function in the caret package does this too.

Friday, April 23, 2010

Top 10 Algorithms in Data Mining

The authors here invited ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners to each nominate up to 10 best-known algorithms in data mining, including the algorithm name, justification for nomination, and a representative publication reference. The list was voted on by other IEEE and ACM award winners to narrow this down to a top 10 list. These algorithms are used for association analysis, classification, clustering, statistical learning, and much more.You can read the paper here.

Here are the winners:

C4.5
The k-Means algorithm
Support Vector Machines
The Apriori algorithm
Expectation-Maximization
PageRank
AdaBoost
k-Nearest Neighbor Classification
Naive Bayes
CART (Classification and Regression Trees)

The 2007 paper gives a brief overview of what the method is commonly used for and how it works, along with lots of references. It also has a much more detailed description of how these winners were selected than what I've said here.

The exciting thing is I've seen nearly all of these algorithms used for mining genetic data for complex patterns of genetic and environmental exposures that influence complex disease. See some recent papers at EvoBio and PSB. Further, lots of these methods are implemented in several R packages.

Top 10 Algorithms in Data Mining (PDF)

Tuesday, December 1, 2009

Get Started with Machine Learning in R

A Beautiful WWW put together a great set of resources for getting started with machine learning in R. First, they recommend the previously mentioned free book, The Elements of Statistical Learning. Then there's a link to a list of dozens of machine learning and statistical learning packages for R. Next, you'll need data. Hundreds of free real datasets are available at the UCI machine learning repository. Each dataset, such as this breast cancer dataset from Wisconsin, has its own page giving a summary, links to publications of major findings, and detailed descriptions of the variables in the data. If you want to simulate genetic data, check out our software, genomeSIMLA, capable of simulating gene-gene interactions in case-control and family-based GWAS-sized datasets with realistic patterns of linkage disequilibrium. If you're interested, check out the genomeSIMLA paper. Finally, if time is not an issue, consider taking MIT's OpenCourseWare Machine Learning course. Alternatively, check out Stanford Engineering professor Andrew Ng - all his lectures are available on youtube. Here's the first lecture.

For more, check out the link below.

A beautiful WWW: Guide to Getting Started in Machine Learning

Wednesday, October 14, 2009

Free Book: Elements of Statistical Learning

The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, one of the best books on data mining and machine learning, is now available free in PDF format.

Download it here or view it online here.

Thursday, September 10, 2009

Machine Learning in R

Revolutions blog recently posted a link to R code by Joshua Reich with self-contained examples of using machine learning techniques in R, including various clustering methods (k-means, nearest neighbor, and kernel), recursive partitioning (CART), principle components analysis, linear discriminant analysis, and support vector machines. This post also links to some slides that go over the basics of machine learning. Looks like a good place to start learning about ML before handrolling your own code.

Be sure to check out one of Will's previous post on hierarchical clustering in R.

Revolutions: Machine learning in R, in a nutshell

This blog has moved!