Monday, February 28, 2011

RStudio: New free IDE for R

Just saw the announcement of the availability of Rstudio, a new (free & open source) integrated development environment for R that works on Windows, Mac, and Linux. Judging from the screenshots, it looks like Rstudio supports syntax highlighting for Sweave & easy PDF creation from Sweave code, which is something I haven't seen anywhere else (on Windows at least). It also has all the features you'd expect of any IDE: code syntax highlighting, code completion, an object and history browser, and a tabbed interface for editing R scripts. I've been happily using NppToR for some time now, and although it works well for me, it's no real IDE solution. I'll have to give this one a shot.

RStudio - an integrated development environment (IDE) for R

Thursday, February 24, 2011

Split a Data Frame into Testing and Training Sets in R

I recently analyzed some data trying to find a model that would explain body fat distribution as predicted by several blood biomarkers. I had more predictors than samples (p>n), and I didn't have a clue which variables, interactions, or quadratic terms made biological sense to put into a model.

I then turned to a few data mining procedures that I learned about during grad school but never really used (LASSO, Random Forest, support vector machines, etc). So far, Random Forest is working unbelievably well. The boostrapping and aggregation ("bagging," i.e. the random component of Random Forest) avoids overfitting so well that I'm able to explain about 80% of the variation in an unseen sample using a model derived from only 30 training samples. (This paper offers the best explanation of Random Forest I've come across).

While doing this I needed to write an R function to split up a dataset into training and testing sets so I could train models on one half and test them on unseen data. I'm sure a function already exists to do something similar, but it was trivial enough to write a function to do it myself.

This function takes a data frame and returns two dataframes (as a list), one called trainset, one called testset.

splitdf <- function(dataframe, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
    index <- 1:nrow(dataframe)
    trainindex <- sample(index, trunc(length(index)/2))
    trainset <- dataframe[trainindex, ]
    testset <- dataframe[-trainindex, ]

In R, you can generally fit a model doing something like this:

mymodel <- method(y~x, data=mydata)
...and then predict the outcome for new data using the generic predict function:

predvals <- predict(mymodel, newdataframe)

Here's some R code that uses the built in iris data, splits the dataset into training and testing sets, and develops a model to predict sepal length based on every other variable in the dataset using Random Forest.

*Edit 2011-02-25* Thanks for all the comments. Clearly the split() function does something very similar to this, and the createDataPartition() function in the caret package does this too.

Tuesday, February 22, 2011

Get all your Questions Answered

When I have a question I usually ask the internet before bugging my neighbor. Yet it seems like Google's search results have become increasingly irrelevant over the last few years, and this is especially true for searching anything related to R (and previously mentioned doesn't really do the job I would expect it to do either).

The last few years has seen the development of several community-powered Q&A websites, and I'm not talking about Yahoo Answers. Here are a few that come to mind that I've used and found extremely helpful.

Biostar ( - a Q&A site for bioinformatics. The site's focus is bioinformatics, computational genomics and biological data analysis. A few of my favorite threads from this site are one on mapping SNPs to pathways, and another on mapping SNPs to genes using tools like the UCSC public MySQL server.

CrossValidated ( - a Q&A site for for statisticians, data miners, and anyone else doing data analysis. This one's relatively new but already has many very talented and extremely helpful users. Last week I asked a question about R², about the difference between variance explained and variation explained, and how that related to Random Forests. The question was answered merely a few hours later.

Finally, there's Quora ( Quora's a little different from the others, and you can ask just about anything you want here. Quora's also still young, but seems to have lots of science/tech geeks like us using it. I recently asked a question, requesting a lay explanation of how Random Forest works, and got a great answer. There was also a good thread about whether current customers found 23andMe to be worth buying.

There's an FAQ on all of these sites that explains how to ask a good question. You might even try answering a few questions yourself and find it rewarding. It's a lot like playing a game, with rather odd goals. You get reputation points and "badges" for answering questions, having your answers voted on, commenting on others' answers, etc. You'll also find that as your own reputation increases by providing good answers to others' questions, your own questions will be answered more quickly. If none of these are quite what you're looking for, check out the stackexchange directory. You'll find Q&A sites that all use the same engine dedicated to topics from photography or cooking to programming and web development.

*Edit 2011-02-22* Thanks to two commenters for pointing this out. There's also a good Q&A community for next generation sequencing, including a forum ( and a StackExchange site (

Thursday, February 17, 2011

R: Given column name in a Data Frame, Get the Index

Had a mental block today trying to figure out how to get the indices of columns in a data frame given their names. Simple task but difficult to search Google for an answer. Thanks to jashapiro, Matt, and Vince for giving me a heads up on the which() function. The which() function returns the indices of TRUE values in a logical vector.

If you're looking at the iris data:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

And you needed to know which column number "Sepal.Width" and "Species" were, use the which() function:

mycols <- c("Sepal.Width","Species")
which(names(iris) %in% mycols)
[1] 2 5
[1] 3


Wednesday, February 16, 2011

Summarize Missing Data for all Variables in a Data Frame in R

Something like this probably already exists in an R package somewhere out there, but I needed a function to summarize how much missing data I have in each variable of a data frame in R. Pass a data frame to this function and for each variable it'll give you the number of missing values, the total N, and the proportion missing.

propmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(, n=length(x), propmiss=sum(

Let's try it out.

#simulate some fake data

   var1 var2
1     1   11
2     2   NA
3    NA   NA
4     4   14
5    NA   NA
6     6   16
7     7   17
8     8   NA
9     9   19
10   10   NA

# summarize the missing data
  nmiss  n propmiss
1     2 10      0.2

  nmiss  n propmiss
1     5 10      0.5

Running that function returns a list of data.frame objects. You can access the proportion missing for var1 by running propmiss(fakedata)$var1$propmis.

*Edit 2011-02-23*

Commenter A. Friedman asked for a version of this function that gives you the output as a data frame. The function's a bit uglier because something was being coerced as a list, but this does the trick:

Tuesday, February 15, 2011

Results from Reference Management Poll

A while back I asked you what reference management software you used, and how well you liked it. I received 180 responses, and here's what you said.

Out of the choices on the poll, most of you used Mendeley (30%), followed by EndNote (23%) and Zotero (15%). Out of those of you who picked "other," it was mostly Papers or Qiqqa. There were even a few brave souls managing references caveman-style, manually.
As for how much you are satisfied with the software you used, there wasn't much variation. If anything, users of EndNote or RefMan ranked their satisfaction slightly lower on average, but my n here is starting to get a little low.

I've been using Mendeley for a few weeks now and like it so far as a replacement for RefMan. The MS Word integration works well, it can use all the EndNote formatting styles, and the online/social features are nice, even though I don't use them very often. Thanks for filling out the poll!

Friday, February 11, 2011

Shellfish for Parallel PCA on GWAS data (Alternative to Eigenstrat)

Recently I tried compiling Eigensoft on my Ubuntu 10.10 Linux system running in Virtualbox and had no success. From comments on this blog post, it looks like the newer Ubuntu distros don't have the libg2c0 and related libraries (which were a part of the gcc3) and gcc4 uses gfortran instead. So it looks like Eigensoft won't be compatible with any of the newer Linux distros, at least without some major tweaking that I'm not prepared to bother with.

Ross Lazarus suggested in a comment to try Shellfish as an alternative. I was able to compile Shellfish without a problem on Ubuntu 10.10 but I haven't had a chance to try it out yet, nor make any comparisons with Eigensoft. The documentation on the website shows that it can directly utilize PLINK ped and map files, so this eliminates the burden of using a tool like PLATO to convert between formats.

Has anyone ever used Shellfish (or anything else besides Eigensoft) for PCA on GWAS or AIMs data?

EDIT 2011-02-14: A tip of the hat to Mike Baldwin for pointing out to me that Eigensoft version 4 is now available on Alkes Price's website. A Google search always puts you at Eigensoft version 3 from the Reich lab software page, which is the old version that doesn't play well with newer Linux distros. I had no problem using Eigensoft 4 on my Ubuntu 10.10 system.

Thursday, February 10, 2011

Extracting values from R summary objects

This builds on a previous post from Stephen.

I was recently running a series of ANOVA analyses, and I used the aov() function because it had a few options that I preferred. Much like lm(), the function returns an object that you typically pass to summary() to view and interpret the output.

It took me a bit of playing to figure out how to extract the information I needed. My aov object is called "fullmod", and here is the summary() output:

> summary(fullmod)

Error: sample
Df Sum Sq Mean Sq
Type 1 1.4535 1.4535

Error: sample:treatment
Df Sum Sq Mean Sq
treatment 1 0.0086021 0.0086021

Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.0077 0.00769 0.0325 0.857505
Type 3 0.3159 0.10529 0.4443 0.722043
age 1 0.0351 0.03512 0.1482 0.701367
sex 1 2.2166 2.21661 9.3547 0.003122 **
hybridization 1 0.1125 0.11255 0.4750 0.492923
Residuals 72 17.0605 0.23695
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I assigned the summary output to an object s so I could examine it a bit further.

> s <- summary(fullmod)

Then I peeked under the hood using the str() function.

> str(s)
List of 3
$ Error: sample :List of 1
..$ :Classes ‘anova’ and 'data.frame': 1 obs. of 3 variables:
.. ..$ Df : num 1
.. ..$ Sum Sq : num 1.45
.. ..$ Mean Sq: num 1.45
..- attr(*, "class")= chr [1:2] "summary.aov" "listof"
$ Error: sample:treatment:List of 1
..$ :Classes ‘anova’ and 'data.frame': 1 obs. of 3 variables:
.. ..$ Df : num 1
.. ..$ Sum Sq : num 0.0086
.. ..$ Mean Sq: num 0.0086
..- attr(*, "class")= chr [1:2] "summary.aov" "listof"
$ Error: Within :List of 1
..$ :Classes ‘anova’ and 'data.frame': 6 obs. of 5 variables:
.. ..$ Df : num [1:6] 1 3 1 1 1 72
.. ..$ Sum Sq : num [1:6] 0.0077 0.3159 0.0351 2.2166 0.1125 ...
.. ..$ Mean Sq: num [1:6] 0.0077 0.1053 0.0351 2.2166 0.1125 ...
.. ..$ F value: num [1:6] 0.0325 0.4443 0.1482 9.3547 0.475 ...
.. ..$ Pr(>F) : num [1:6] 0.8575 0.72204 0.70137 0.00312 0.49292 ...
..- attr(*, "class")= chr [1:2] "summary.aov" "listof"
- attr(*, "class")= chr "summary.aovlist"

This tells me that s is a list of 3 objects ("Error: sample", "Error: sample:treatment:", and "Error: within"). Each of these objects is a list of 1 element for (some reason!?) with no name or identifier. For "Error: sample" and "Error: sample:treatment:", there are three single numbers. For "Error: Within", each element contains a list of 6 numbers (one for each coefficient in my model).

So, to access the degrees of freedom ("Df") for the "Error: sample", I do the following:

> s[[1]][[1]][[1]]
[1] 1

Or if I prefer to use labels (for the elements that have them)

> s[["Error: sample"]][[1]][["Df"]]
[1] 1

Just to clarify:

s [["Error: sample"]] [[1]] [["Df"]]
^ ^ ^
1st layer 2nd unnamed layer 3rd layer

It is VERY important to use [[ instead of [ here. In the R syntax, [[ retrieves a SINGLE element of the object/array and [ retrieves the full object as a list.

> s["Error: sample"][1]["Df"]

See: that didn't work.

To retrieve the p-values I'm interested in, I'll have to dig one more layer deeper to pull the p-value for the coefficient I want (in this case, the 4th element).

> s[[3]][[1]][[5]][[4]]
[1] 0.003122076


> s[["Error: Within"]][[1]][["Pr(>F)"]][[4]]
[1] 0.003122076

Using this basic strategy, you should be able to pull any value from a summary object, but remember that using offsets as I have can be dangerous. For this to work consistently across many models, they must all have the same number of coefficients in the model.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.