Data “janitor-work”
The New York Times recently ran a piece on wrangling and cleaning data:
Whether you call it “janitor-work,” wrangling/munging, cleaning/cleansing/scrubbing, tidying, or something else, the article above is worth a read (even though it implicitly denigrates the important work that your housekeeping staff does). It’s one of the few “Big Data” pieces that truly appreciates what we spend so much time doing every day as data scientists/janitors/sherpas/cowboys/etc. The article was chock-full of quotable tidbits on the nature of data munging as part of the analytical workflow:
It’s an absolute myth that you can send an algorithm over raw data and have insights pop up…Data scientists … spend 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.But if the value comes from combining different data sets, so does the headache… Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.
As data analysis experts we justify our existence by (accurately) evangelizing that the bottleneck in the discovery process is usually not data generation, it’s data analysis. I clarify that point further with my collaborators: data analysis is usually the easy part — if I give you properly formatted, tidy, and rigorously quality-controlled data, hitting the analysis “button” is usually much easier than the work that went into cleaning, QC’ing, and preparing the data in the first place.
To that effect, I’d like to introduce you to a tool that recently made its way into my data analysis toolbox.
dplyr
Unless you’ve had your head buried in the sand of the data analysis desert for the last few years, you’ve definitely encountered a number of tools in the “Hadleyverse.” These are R packages created by Hadley Wickham and friends that make things like data visualization (ggplot2), data management and split-apply-combine analysis (plyr, reshape2), and R package creation and documentation (devtools, roxygen2) much easier.
The dplyr package introduces a few simple functions, and integrates functionality from the magrittr package that will fundamentally change the way you write R code.
I’m not going to give you a full tutorial on dplyr because the vignette should take you 15 minutes to go through, and will do a much better job of introducing this “grammar of data manipulation” than I could do here. But I’ll try to hit a few higlights.
dplyr verbs
First, dplyr works on data frames, and introduces a few “verbs” that allow you to do some basic manipulation:
filter()
filters rows from the data frame by some criterionarrange()
arranges rows ascending or descending based on the value(s) of one or more columnsselect()
allows you to select one or more columnsmutate()
allows you to add new columns to a data frame that are transformations of other columnsgroup_by()
andsummarize()
are usually used together, and allow you to compute values grouped by some other variable, e.g., the mean, SD, and count of all the values of$y
separately for each level of factor variable$group
.
Individually, none of these add much to your R arsenal that wasn’t already baked into the language in some form or another. But the real power comes from chaining these commands together, with the
%>%
operator.
The %>%
operator: “then”
This dataset and example was taken directly from, and is described more verbosely in the vignette. The
hflights
dataset has information about more than 200,000 flights that departed Houston in 2011. Let’s say we want to do the following:- Use the
hflights
dataset group_by
the Year, Month, and Dayselect
out only the Day, the arrival delay, and the departure delay variables- Use
summarize
to calculate the mean of the arrival and departure delays filter
the resulting dataset where the arrival delay or the departure delay is more than 30 minutes.
Here’s an example of how you might have done this previously:
filter(
summarise(
select(
group_by(hflights, Year, Month, DayofMonth),
Year:DayofMonth, ArrDelay, DepDelay
),
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
),
arr > 30 | dep > 30
)
Notice that the order that we write the code in this example is inside out - we describe our problem as: use
hflights
, then group_by
, then select
, then summarize
, then filter
, but traditionally in R we write the code inside-out by nesting functions 4-deep:filter(summarize(select(group_by(hflights, ...), ...), ...), ...)
To fix this, dplyr provides the
%>%
operator (pronounced “then”). x %>% f(y)
turns into f(x, y)
so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Writing the code this way actually follows the order we think about the problem (use
hflights
, then group_by
, then select
, then summarize
, then filter
it).
You aren’t limited to using
%>%
to only dplyr functions. You can use it with anything. E.g., instead of head(iris, 10)
, we could write iris %>% head(10)
to get the first ten lines of the built-in iris dataset. Furthermore, since the input to ggplot
is always a data.frame, we can munge around a dataset then pipe the whole thing into a plotting function. Here’s a simple example where we take the iris dataset, then group it by Species, then summarize it by calculating the mean of the Sepal.Length, then use ggplot2 to make a simple bar plot.library(dplyr)
library(ggplot2)
iris %>%
group_by(Species) %>%
summarize(meanSepLength=mean(Sepal.Length)) %>%
ggplot(aes(Species, meanSepLength)) + geom_bar(stat="identity")
Once you start using
%>%
you’ll wonder to yourself why this isn’t a core part of the R language itself rather than add-on functionality provided by a package. It will fundamentally change the way you write R code, making it feel more natural and making your code more readable. There's a lot more dplyr can do with databases that I didn't even mention, and if you're interested, you should see the other vignettes on the CRAN package page.
As a side note, I’ve linked to it several times here, but you should really check out Hadley’s Tidy Data paper and the tidyr package, vignette, and blog post.
dplyr package: http://cran.r-project.org/web/packages/dplyr/index.html
dplyr on SO: http://stackoverflow.com/questions/tagged/dplyr