## Tuesday, September 16, 2014

### R package to convert statistical analysis objects to tidy data frames

I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.
R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.
(From the documentation): if you performed a linear model on the built-in `mtcars` dataset and view the object directly, this is what you’d see:
``````lmfit = lm(mpg ~ wt, mtcars)
lmfit
``````
``````Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt
37.285       -5.344
``````
``````summary(lmfit)
``````
``````Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
Min     1Q Median     3Q    Max
-4.543 -2.365 -0.125  1.410  6.873

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10
``````
If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with `str()` and pulling out coefficients using indices, and everything gets ugly quick.
But the `tidy` function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:
``````tidy(lmfit)
``````
``````         term estimate stderror statistic   p.value
1 (Intercept)   37.285   1.8776    19.858 8.242e-19
2          wt   -5.344   0.5591    -9.559 1.294e-10
``````
The `tidy()` function also works on other types of model objects, like those produced by `glm()` and `nls()`, as well as popular built-in hypothesis testing tools like `t.test()`, `cor.test()`, or `wilcox.test()`.
View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.