I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.
R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.
(From the documentation): if you performed a linear model on the built-in
mtcars
dataset and view the object directly, this is what you’d see:lmfit = lm(mpg ~ wt, mtcars)
lmfit
Call:
lm(formula = mpg ~ wt, data = mtcars)
Coefficients:
(Intercept) wt
37.285 -5.344
summary(lmfit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.543 -2.365 -0.125 1.410 6.873
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.86 < 2e-16 ***
wt -5.344 0.559 -9.56 1.3e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared: 0.753, Adjusted R-squared: 0.745
F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with
str()
and pulling out coefficients using indices, and everything gets ugly quick.
But the
tidy
function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:tidy(lmfit)
term estimate stderror statistic p.value
1 (Intercept) 37.285 1.8776 19.858 8.242e-19
2 wt -5.344 0.5591 -9.559 1.294e-10
The
tidy()
function also works on other types of model objects, like those produced by glm()
and nls()
, as well as popular built-in hypothesis testing tools like t.test()
, cor.test()
, or wilcox.test()
.
View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.
Hadley Wickham's official tidy data paper is already out: https://twitter.com/hadleywickham/status/510515769793200128
ReplyDeleteAlmost perfect, I'd also want the r^2's from the model... Cool tips!
ReplyDeleteYou can get R^2 using a different tidying method from broom: glance(lmfit). While "tidy" is for computing values that are one-row-per-coefficient, glance returns values that are one-per-model, like R^2, adjusted R^2, sigma, AIC, BIC, etc. You can read more about the distinction in the broom manuscript: http://arxiv.org/abs/1412.3565
Delete