Getting Genetics Done: R package to convert statistical analysis objects to tidy data frames

Tuesday, September 16, 2014

R package to convert statistical analysis objects to tidy data frames

I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.

R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.

(From the documentation): if you performed a linear model on the built-in mtcars dataset and view the object directly, this is what you’d see:

lmfit = lm(mpg ~ wt, mtcars)
lmfit

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344

summary(lmfit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with str() and pulling out coefficients using indices, and everything gets ugly quick.

But the tidy function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:

tidy(lmfit)

         term estimate stderror statistic   p.value
1 (Intercept)   37.285   1.8776    19.858 8.242e-19
2          wt   -5.344   0.5591    -9.559 1.294e-10

The tidy() function also works on other types of model objects, like those produced by glm() and nls(), as well as popular built-in hypothesis testing tools like t.test(), cor.test(), or wilcox.test().

View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.

broom: Convert statistical analysis objects from R into tidy format

3 comments:

UnknownSeptember 17, 2014 at 1:02 AM
Hadley Wickham's official tidy data paper is already out: https://twitter.com/hadleywickham/status/510515769793200128
ReplyDelete
Replies
swvanderlaanDecember 9, 2014 at 1:04 AM
Almost perfect, I'd also want the r^2's from the model... Cool tips!
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Tuesday, September 16, 2014

R package to convert statistical analysis objects to tidy data frames

3 comments: