Monday, December 8, 2014

Importing Illumina BeadArray data into R

A colleague needed some help getting Illumina BeadArray gene expression data loaded into R for data analysis with limma. Hopefully whoever ran your arrays can export the data as text files formatted as described in the code below. If so, you can import those text files directly using the beadarray package. This way you avoid getting bogged down with GenomeStudio, which requires a license (ugh) and only runs on Windows (ughhh). Here's how I do it.

Thursday, November 20, 2014

RNA-seq Data Analysis Course Materials

Last week I ran a one-day workshop on RNA-seq data analysis in the UVA Health Sciences Library. I set up an AWS public EC2 image with all the necessary software installed. Participants logged into AWS, launched the image, and we kicked off the morning session with an introduction to the Unix shell (taught by Jessica Bonnie, a biostatistician here in our genomics group, and a fellow Software Carpentry instructor). I followed with a walkthrough of using FastQC for quality assessment, FASTX toolkit for trimming, TopHat for alignment, and featureCounts to summarize gene expression read counts at the gene level. I started the afternoon session started with an introduction to R, followed by a tutorial on analyzing the count data we generated in the first part using DESeq2 in R.

All of the rendered course material is available here. The source code used to generate this material is all on available on GitHub (go read my post on collaborative lesson development, if you haven't already). Much of the introductory Unix lesson material was adapted from the Software Carpentry and Data Carpentry projects.

I wrote a more thorough blog post about how the course went here on the Software Carpentry blog.

I also compiled a PDF of all the course materials, available on Figshare:

Tuesday, October 14, 2014

Operate on the body of a file but not the header

Sometimes you need to run some UNIX command on a file but only want to operate on the body of the file, not the header. Create a file called body somewhere in your $PATH, make it executable, and add this to it:
IFS= read -r header
printf '%s\n' "$header"
eval $@
Now, when you need to run something but ignore the header, use the body command first. For example, we can create a simple data set with a header row and some numbers:
$ echo -e "header\n1\n5\n4\n7\n3"
We can pipe the whole thing to sort:
$ echo -e "header\n1\n5\n4\n7\n3" | sort
Oops, we don’t want the header to be included in the sort. Let’s use the body command to operate only on the body, skipping the header:
$ echo -e "header\n1\n5\n4\n7\n3" | body sort
Sure, there are other ways to solve the problem with sort, but body will solve many more problems. If you have multiple header rows, just call body multiple times.
Inspired by this post on Stack Exchange.

Tuesday, September 16, 2014

R package to convert statistical analysis objects to tidy data frames

I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.
R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.
(From the documentation): if you performed a linear model on the built-in mtcars dataset and view the object directly, this is what you’d see:
lmfit = lm(mpg ~ wt, mtcars)
lm(formula = mpg ~ wt, data = mtcars)

(Intercept)           wt  
     37.285       -5.344
lm(formula = mpg ~ wt, data = mtcars)

   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10
If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with str() and pulling out coefficients using indices, and everything gets ugly quick.
But the tidy function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:
         term estimate stderror statistic   p.value
1 (Intercept)   37.285   1.8776    19.858 8.242e-19
2          wt   -5.344   0.5591    -9.559 1.294e-10
The tidy() function also works on other types of model objects, like those produced by glm() and nls(), as well as popular built-in hypothesis testing tools like t.test(), cor.test(), or wilcox.test().
View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.

Thursday, September 11, 2014

UVA / Charlottesville R Meetup

TL;DR? We started an R Users group, awesome community, huge turnout at first meeting, lots of potential.


I've sat through many hours of meetings where faculty lament the fact that their trainees (and the faculty themselves!) are woefully ill-prepared for our brave new world of computing- and data-intensive science. We've started to address this by running annual Software Carpentry bootcamps (March 2013, and March 2014). To make things more sustainable, we're running our own Software Carpentry instructor training here later this month, where we'll train scientists how to teach other scientists basic computing skills like using UNIX, programming in Python or R, version control, automation, and testing. I went through this training course online a few months ago, and it was an excellent introduction to pedagogy and the psychology of learning (let's face it, most research professors were never taught how to teach; it's a skill you learn, not one you inherit with the initials behind your name).

Something that constantly comes up in these conversations is how to promote continued learning and practice after the short bootcamp is over. Students get a whirlwind tour of scientific computing skills over two days, but there's generally very little follow-up that's necessary to encourage continued practice and learning.

At the same time we've got a wide variety of scientists spanning all disciplines including social sciences, humanities, medicine, physics, chemistry, math, and engineering that are doing scientific computing and data analysis on a daily basis who could really benefit from learning from one another.

These things motivated us to start a local R Users Group. So far we have 118 people registered on, and this week we had an excellent turnout at our first meeting, with over 70 people who RSVP'd.

At this first meetup we kicked things off with an introduction to the group, why we started it, and our goals. I then launched into a quick demo of some of the finer features in the dplyr package for effortless advanced data manipulation and analysis. The meetup group has a GitHub repository where all the code from our meetups will be stored. Finally, we concluded with a discussion of topics the group would like to see presented in the future: ggplot2, R package creation, reproducible research, dynamic documentation, and web scraping were a few of the things mentioned. We collectively decided that either talks could be either research talks highlighting how these things were used in an actual research project, or they could be demo/tutorial in nature, like the dplyr talk I gave.

The rich discussion we had at the end of this session really highlighted the potential this community has. In addition to the diversity and relative gender-balance we saw in our first meetup's participants, we had participants from all over UVA as well as representation from local industry and non-profit organizations.

I, for one, am tremendously excited about this new community, and will post the occasional update from the group here on this blog.

Wednesday, September 10, 2014

GNU datamash

GNU datamash is a command-line utility that offers simple calculations (e.g. count, sum, min, max, mean, stdev, string coalescing) as well as a rich set of statistical functions, to quickly assess information in textual input files or from a UNIX pipe. Here are a few examples:
E.g., let’s use the seq command to generate some data, and use datamash sum to add up all the values in the first column:
$ seq 5
$ seq 5 | datamash sum 1
What else? Let’s calculate the mean, 1st quartile, median, 3rd quarile, IQR, sample-standard-deviation, and p-value of Jarque-Bera test for normal distribution, using some data in file.txt:
$ cat file.txt | datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1
mean(x)   q1(x)  median(x)  q3(x)   iqr(x)  sstdev(x)  jarque(x)
45.32     23     37         61.5    38.5    30.4487    8.0113e-09
Go take a look at more examples, specifically the examples using datamash to parse a GTF file. Datamash can very quickly do things like find the number of isoforms per gene by grouping (collapsing) over gene IDs and counting the number of transcripts, find the number of genes with more than 5 isoforms, find genes transcribed from multiple chromosomes, examine variability in exon counts per gene, etc.

Wednesday, August 20, 2014

Do your "data janitor work" like a boss with dplyr

Data “janitor-work”

The New York Times recently ran a piece on wrangling and cleaning data:
Whether you call it “janitor-work,” wrangling/munging, cleaning/cleansing/scrubbing, tidying, or something else, the article above is worth a read (even though it implicitly denigrates the important work that your housekeeping staff does). It’s one of the few “Big Data” pieces that truly appreciates what we spend so much time doing every day as data scientists/janitors/sherpas/cowboys/etc. The article was chock-full of quotable tidbits on the nature of data munging as part of the analytical workflow:
It’s an absolute myth that you can send an algorithm over raw data and have insights pop up…
Data scientists … spend 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
But if the value comes from combining different data sets, so does the headache… Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.
As data analysis experts we justify our existence by (accurately) evangelizing that the bottleneck in the discovery process is usually not data generation, it’s data analysis. I clarify that point further with my collaborators: data analysis is usually the easy part — if I give you properly formatted, tidy, and rigorously quality-controlled data, hitting the analysis “button” is usually much easier than the work that went into cleaning, QC’ing, and preparing the data in the first place.
To that effect, I’d like to introduce you to a tool that recently made its way into my data analysis toolbox.


Unless you’ve had your head buried in the sand of the data analysis desert for the last few years, you’ve definitely encountered a number of tools in the “Hadleyverse.” These are R packages created by Hadley Wickham and friends that make things like data visualization (ggplot2), data management and split-apply-combine analysis (plyr, reshape2), and R package creation and documentation (devtools, roxygen2) much easier.
The dplyr package introduces a few simple functions, and integrates functionality from the magrittr package that will fundamentally change the way you write R code.
I’m not going to give you a full tutorial on dplyr because the vignette should take you 15 minutes to go through, and will do a much better job of introducing this “grammar of data manipulation” than I could do here. But I’ll try to hit a few higlights.

dplyr verbs

First, dplyr works on data frames, and introduces a few “verbs” that allow you to do some basic manipulation:
  • filter() filters rows from the data frame by some criterion
  • arrange() arranges rows ascending or descending based on the value(s) of one or more columns
  • select() allows you to select one or more columns
  • mutate() allows you to add new columns to a data frame that are transformations of other columns
  • group_by() and summarize() are usually used together, and allow you to compute values grouped by some other variable, e.g., the mean, SD, and count of all the values of $y separately for each level of factor variable $group.
Individually, none of these add much to your R arsenal that wasn’t already baked into the language in some form or another. But the real power comes from chaining these commands together, with the %>% operator.

The %>% operator: “then”

This dataset and example was taken directly from, and is described more verbosely in the vignette. The hflights dataset has information about more than 200,000 flights that departed Houston in 2011. Let’s say we want to do the following:
  1. Use the hflights dataset
  2. group_by the Year, Month, and Day
  3. select out only the Day, the arrival delay, and the departure delay variables
  4. Use summarize to calculate the mean of the arrival and departure delays
  5. filter the resulting dataset where the arrival delay or the departure delay is more than 30 minutes.
Here’s an example of how you might have done this previously:
      group_by(hflights, Year, Month, DayofMonth),
      Year:DayofMonth, ArrDelay, DepDelay
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  arr > 30 | dep > 30
Notice that the order that we write the code in this example is inside out - we describe our problem as: use hflights, then group_by, then select, then summarize, then filter, but traditionally in R we write the code inside-out by nesting functions 4-deep:
filter(summarize(select(group_by(hflights, ...), ...), ...), ...)
To fix this, dplyr provides the %>% operator (pronounced “then”). x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:
hflights %>%
  group_by(Year, Month, DayofMonth) %>%
  select(Year:DayofMonth, ArrDelay, DepDelay) %>%
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)
Writing the code this way actually follows the order we think about the problem (use hflights, then group_by, then select, then summarize, then filter it).
You aren’t limited to using %>% to only dplyr functions. You can use it with anything. E.g., instead of head(iris, 10), we could write iris %>% head(10) to get the first ten lines of the built-in iris dataset. Furthermore, since the input to ggplot is always a data.frame, we can munge around a dataset then pipe the whole thing into a plotting function. Here’s a simple example where we take the iris dataset, then group it by Species, then summarize it by calculating the mean of the Sepal.Length, then use ggplot2 to make a simple bar plot.
iris %>%
  group_by(Species) %>%
  summarize(meanSepLength=mean(Sepal.Length)) %>%
  ggplot(aes(Species, meanSepLength)) + geom_bar(stat="identity")
Once you start using %>% you’ll wonder to yourself why this isn’t a core part of the R language itself rather than add-on functionality provided by a package. It will fundamentally change the way you write R code, making it feel more natural and making your code more readable. There's a lot more dplyr can do with databases that I didn't even mention, and if you're interested, you should see the other vignettes on the CRAN package page.
As a side note, I’ve linked to it several times here, but you should really check out Hadley’s Tidy Data paper and the tidyr package, vignette, and blog post.

Monday, July 7, 2014

Introduction to R for Life Scientists: Course Materials

Last week I taught a three-hour introduction to R workshop for life scientists at UVA's Health Sciences Library.

I broke the workshop into three sections:

In the first half hour or so I presented slides giving an overview of R and why R is so awesome. During this session I emphasized reproducible research and gave a demonstration of using knitr + rmarkdown in RStudio to produce a PDF that can easily be recompiled when data updates.

In the second (longest) section, participants had their laptops out with RStudio open coding along with me as I gave an introduction to R data types, functions, getting help, data frames, subsetting, and plotting. Participants were challenged with an exercise requiring them to create a scatter plot using a subset of the built-in mtcars dataset.

We concluded with an analysis of RNA-seq data using the DESeq2 package. We started with a count matrix and a metadata file (the modENCODE pasilla knockout data packaged with DESeq2), imported the data into a DESeqDataSet object, ran the DESeq pipeline, extracted results, and did some basic visualization (MA-plots, PCA, volcano plots, etc). A future day-long course will cover RNA-seq in more detail (intro UNIX, alignment, & quantitation in the morning; intro R, QC, and differential expression analysis in the afternoon).

I wrote the course materials using knitr, rendered using Jekyll, hosted as a GitHub project page. The rendered course materials can be found at the link below, and the source is on GitHub.

Course Materials: Introduction to R for Life Scientists


Cheat Sheet:

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.