While flirting around with previously mentioned ggplot2 I came across an incredibly useful set of functions in the plyr package, made by Hadley Wickham, the same guy behind ggplot2. If you've ever used MySQL before, think of "GROUP BY", but here you can arbitrarily apply any R function to splits of the data, or write one yourself.
Imagine you have two SNPs, and 2 different variables you want info about across all three genotypes at each locus. You can generate some fake data using this command:
mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2), SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)), SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))
The data should look like this:
X1 X2 SNP1 SNP2 1 -0.73671602 2.76733658 AA BB 2 -1.83947190 5.70194282 AA BB 3 -0.46963370 4.89393264 AA BB 4 0.85105460 6.37610477 AA BB 5 -1.69043368 8.11399122 AA BB 6 -1.47995127 4.32866212 AA BB 7 0.21026063 4.32073489 AA BB 8 0.19038088 4.26631298 AA BB 9 -0.29759084 4.07349852 AA BB 10 -0.46843672 7.01048905 AA BB 11 0.93804369 3.78855823 Aa Bb 12 -1.46223800 5.08756968 Aa Bb 13 -0.01338604 -0.01494702 Aa Bb 14 0.13704332 4.53625220 Aa Bb 15 0.59214151 3.75293124 Aa Bb 16 -0.27658821 5.78701933 Aa Bb 17 0.05527139 5.99460464 Aa Bb 18 1.00077756 5.26838121 Aa Bb 19 0.62871877 6.98314827 Aa Bb 20 1.18925876 8.61457227 Aa Bb 21 -0.40389888 3.36446128 aa bb 22 -1.43420595 6.50801614 aa bb 23 1.79086285 5.39616828 aa bb 24 0.15886243 3.98010396 aa bb 25 0.57746024 4.93605364 aa bb 26 -1.11158987 6.40760662 aa bb 27 1.93484117 3.33698750 aa bb 28 0.10803302 4.56442931 aa bb 29 0.56648591 4.48502267 aa bb 30 0.03616814 5.77160936 aa bb
Now let's say you want the mean and standard deviation of X1 and X2 across each of the multilocus genotypes at SNP1 and SNP2. First, install and load the plyr package:
install.packages("plyr")
library(plyr)
Now run the following command:
ddply(mydata, c("SNP1","SNP2"), function(df) data.frame(meanX1=mean(df$X1), sdX1=sd(df$X1), meanX2=mean(df$X2), sdX2=sd(df$X2)))
And you get exactly what you asked for:
SNP1 SNP2 meanX1 sdX1 meanX2 sdX2 1 aa bb 0.2223019 1.0855063 4.875046 1.144623 2 Aa Bb 0.2789043 0.7817727 4.979809 2.286914 3 AA BB -0.5730538 0.8834038 5.185301 1.601626
That command above may appear complicated at first, but it isn't really. Cerebral mastication gives a very easy to follow, concise explanation of what that command is doing. Also, check out the slides he presented at a recent R meetup discussion:
There's more to plyr than ddply, so check it out for yourself!
thanks for bringing this to light - as someone familiar with SQL and now being "forced" to solely work in R, i've been frustrated with the lack of group-by type functions. your blog overall has been very helpful to me. i'm so glad i came by your poster at ASHG this fall - otherwise i would have never found out about it!
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi there ! Struggling with the same problem here. Coming from SQL and "group by"-thinking I had huge problems finding a similarly simple and elegant solution in R, e.g. tapply() works, but is cumbersome. I could just use the sqldf package but that would be cheating :)
ReplyDeleteIf u want the mean and sd of one variable only, a more "readable" version of the code can be made combining plyr with the incredibly useful smean.sd() function from the Hmisc package.
ddply(mydata, .(SNP1), function(df)smean.sd(df$SNP1))
I love the fact that R has a million ways of solving a problem and that each user can find one that suits him or her.
BTW the "data.frame" statement in your code is superfluous , a c() will suffice (as ddply always returns a dataframe)
I'm sure there are other ways, anyhow plyr is definitely in my toolbox to stay!
I enjoy your blog. It is helpful. Plyr is definitely cool and provides some much needed consistency when applying functions across data frames, arrays, and lists. I plan to use it more. On the other hand - I just wanted to point out that this exercise is also very doable with a "standard" R function like aggregate - in case there were those who thought that one *had* to use plyr to solve a problem like this (mostly new comers to R). This is off the top of my head so there may even be more "short cuts" though I think this is readable. I used your data as it appears on this page.
ReplyDelete> f = function(x) {mx=mean(x);sdx=sd(x);mz=c(mean=mx,sd=sdx)}
> aggregate(mydata[c('X1','X2')],by=list(SNP1=SNP1,SNP2=SNP2),f)
SNP1 SNP2 X1.mean X1.sd X2.mean X2.sd
1 AA BB 0.4007287 0.8677048 6.013512 2.594847
2 Aa Bb 0.1908830 0.8702564 5.810701 1.806888
3 aa bb 0.4249260 1.4266786 5.496132 1.824754