Wednesday, August 20, 2014

Do your "data janitor work" like a boss with dplyr

Data “janitor-work”

The New York Times recently ran a piece on wrangling and cleaning data:
Whether you call it “janitor-work,” wrangling/munging, cleaning/cleansing/scrubbing, tidying, or something else, the article above is worth a read (even though it implicitly denigrates the important work that your housekeeping staff does). It’s one of the few “Big Data” pieces that truly appreciates what we spend so much time doing every day as data scientists/janitors/sherpas/cowboys/etc. The article was chock-full of quotable tidbits on the nature of data munging as part of the analytical workflow:
It’s an absolute myth that you can send an algorithm over raw data and have insights pop up…
Data scientists … spend 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
But if the value comes from combining different data sets, so does the headache… Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.
As data analysis experts we justify our existence by (accurately) evangelizing that the bottleneck in the discovery process is usually not data generation, it’s data analysis. I clarify that point further with my collaborators: data analysis is usually the easy part — if I give you properly formatted, tidy, and rigorously quality-controlled data, hitting the analysis “button” is usually much easier than the work that went into cleaning, QC’ing, and preparing the data in the first place.
To that effect, I’d like to introduce you to a tool that recently made its way into my data analysis toolbox.

dplyr

Unless you’ve had your head buried in the sand of the data analysis desert for the last few years, you’ve definitely encountered a number of tools in the “Hadleyverse.” These are R packages created by Hadley Wickham and friends that make things like data visualization (ggplot2), data management and split-apply-combine analysis (plyr, reshape2), and R package creation and documentation (devtools, roxygen2) much easier.
The dplyr package introduces a few simple functions, and integrates functionality from the magrittr package that will fundamentally change the way you write R code.
I’m not going to give you a full tutorial on dplyr because the vignette should take you 15 minutes to go through, and will do a much better job of introducing this “grammar of data manipulation” than I could do here. But I’ll try to hit a few higlights.

dplyr verbs

First, dplyr works on data frames, and introduces a few “verbs” that allow you to do some basic manipulation:
  • filter() filters rows from the data frame by some criterion
  • arrange() arranges rows ascending or descending based on the value(s) of one or more columns
  • select() allows you to select one or more columns
  • mutate() allows you to add new columns to a data frame that are transformations of other columns
  • group_by() and summarize() are usually used together, and allow you to compute values grouped by some other variable, e.g., the mean, SD, and count of all the values of $y separately for each level of factor variable $group.
Individually, none of these add much to your R arsenal that wasn’t already baked into the language in some form or another. But the real power comes from chaining these commands together, with the %>% operator.

The %>% operator: “then”

This dataset and example was taken directly from, and is described more verbosely in the vignette. The hflights dataset has information about more than 200,000 flights that departed Houston in 2011. Let’s say we want to do the following:
  1. Use the hflights dataset
  2. group_by the Year, Month, and Day
  3. select out only the Day, the arrival delay, and the departure delay variables
  4. Use summarize to calculate the mean of the arrival and departure delays
  5. filter the resulting dataset where the arrival delay or the departure delay is more than 30 minutes.
Here’s an example of how you might have done this previously:
filter(
  summarise(
    select(
      group_by(hflights, Year, Month, DayofMonth),
      Year:DayofMonth, ArrDelay, DepDelay
    ),
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)
Notice that the order that we write the code in this example is inside out - we describe our problem as: use hflights, then group_by, then select, then summarize, then filter, but traditionally in R we write the code inside-out by nesting functions 4-deep:
filter(summarize(select(group_by(hflights, ...), ...), ...), ...)
To fix this, dplyr provides the %>% operator (pronounced “then”). x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:
hflights %>%
  group_by(Year, Month, DayofMonth) %>%
  select(Year:DayofMonth, ArrDelay, DepDelay) %>%
  summarise(
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)
Writing the code this way actually follows the order we think about the problem (use hflights, then group_by, then select, then summarize, then filter it).
You aren’t limited to using %>% to only dplyr functions. You can use it with anything. E.g., instead of head(iris, 10), we could write iris %>% head(10) to get the first ten lines of the built-in iris dataset. Furthermore, since the input to ggplot is always a data.frame, we can munge around a dataset then pipe the whole thing into a plotting function. Here’s a simple example where we take the iris dataset, then group it by Species, then summarize it by calculating the mean of the Sepal.Length, then use ggplot2 to make a simple bar plot.
library(dplyr)
library(ggplot2)
iris %>%
  group_by(Species) %>%
  summarize(meanSepLength=mean(Sepal.Length)) %>%
  ggplot(aes(Species, meanSepLength)) + geom_bar(stat="identity")
Once you start using %>% you’ll wonder to yourself why this isn’t a core part of the R language itself rather than add-on functionality provided by a package. It will fundamentally change the way you write R code, making it feel more natural and making your code more readable. There's a lot more dplyr can do with databases that I didn't even mention, and if you're interested, you should see the other vignettes on the CRAN package page.
As a side note, I’ve linked to it several times here, but you should really check out Hadley’s Tidy Data paper and the tidyr package, vignette, and blog post.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.