Wednesday, November 4, 2009

Split, apply, and combine in R using PLYR

While flirting around with previously mentioned ggplot2 I came across an incredibly useful set of functions in the plyr package, made by Hadley Wickham, the same guy behind ggplot2.  If you've ever used MySQL before, think of "GROUP BY", but here you can arbitrarily apply any R function to splits of the data, or write one yourself.

Imagine you have two SNPs, and 2 different variables you want info about across all three genotypes at each locus.  You can generate some fake data using this command:

mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2), SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)), SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))

The data should look like this:

            X1          X2 SNP1 SNP2
1  -0.73671602  2.76733658   AA   BB
2  -1.83947190  5.70194282   AA   BB
3  -0.46963370  4.89393264   AA   BB
4   0.85105460  6.37610477   AA   BB
5  -1.69043368  8.11399122   AA   BB
6  -1.47995127  4.32866212   AA   BB
7   0.21026063  4.32073489   AA   BB
8   0.19038088  4.26631298   AA   BB
9  -0.29759084  4.07349852   AA   BB
10 -0.46843672  7.01048905   AA   BB
11  0.93804369  3.78855823   Aa   Bb
12 -1.46223800  5.08756968   Aa   Bb
13 -0.01338604 -0.01494702   Aa   Bb
14  0.13704332  4.53625220   Aa   Bb
15  0.59214151  3.75293124   Aa   Bb
16 -0.27658821  5.78701933   Aa   Bb
17  0.05527139  5.99460464   Aa   Bb
18  1.00077756  5.26838121   Aa   Bb
19  0.62871877  6.98314827   Aa   Bb
20  1.18925876  8.61457227   Aa   Bb
21 -0.40389888  3.36446128   aa   bb
22 -1.43420595  6.50801614   aa   bb
23  1.79086285  5.39616828   aa   bb
24  0.15886243  3.98010396   aa   bb
25  0.57746024  4.93605364   aa   bb
26 -1.11158987  6.40760662   aa   bb
27  1.93484117  3.33698750   aa   bb
28  0.10803302  4.56442931   aa   bb
29  0.56648591  4.48502267   aa   bb
30  0.03616814  5.77160936   aa   bb

Now let's say you want the mean and standard deviation of X1 and X2 across each of the multilocus genotypes at SNP1 and SNP2.  First, install and load the plyr package:


Now run the following command:

ddply(mydata, c("SNP1","SNP2"), function(df) data.frame(meanX1=mean(df$X1), sdX1=sd(df$X1), meanX2=mean(df$X2), sdX2=sd(df$X2)))

And you get exactly what you asked for:

  SNP1 SNP2     meanX1      sdX1   meanX2     sdX2
1   aa   bb  0.2223019 1.0855063 4.875046 1.144623
2   Aa   Bb  0.2789043 0.7817727 4.979809 2.286914
3   AA   BB -0.5730538 0.8834038 5.185301 1.601626

That command above may appear complicated at first, but it isn't really.  Cerebral mastication gives a very easy to follow, concise explanation of what that command is doing.  Also, check out the slides he presented at a recent R meetup discussion:

There's more to plyr than ddply, so check it out for yourself!


  1. thanks for bringing this to light - as someone familiar with SQL and now being "forced" to solely work in R, i've been frustrated with the lack of group-by type functions. your blog overall has been very helpful to me. i'm so glad i came by your poster at ASHG this fall - otherwise i would have never found out about it!

  2. This comment has been removed by the author.

  3. This comment has been removed by the author.

  4. Hi there ! Struggling with the same problem here. Coming from SQL and "group by"-thinking I had huge problems finding a similarly simple and elegant solution in R, e.g. tapply() works, but is cumbersome. I could just use the sqldf package but that would be cheating :)

    If u want the mean and sd of one variable only, a more "readable" version of the code can be made combining plyr with the incredibly useful function from the Hmisc package.

    ddply(mydata, .(SNP1), function(df)$SNP1))

    I love the fact that R has a million ways of solving a problem and that each user can find one that suits him or her.

    BTW the "data.frame" statement in your code is superfluous , a c() will suffice (as ddply always returns a dataframe)

    I'm sure there are other ways, anyhow plyr is definitely in my toolbox to stay!

  5. I enjoy your blog. It is helpful. Plyr is definitely cool and provides some much needed consistency when applying functions across data frames, arrays, and lists. I plan to use it more. On the other hand - I just wanted to point out that this exercise is also very doable with a "standard" R function like aggregate - in case there were those who thought that one *had* to use plyr to solve a problem like this (mostly new comers to R). This is off the top of my head so there may even be more "short cuts" though I think this is readable. I used your data as it appears on this page.

    > f = function(x) {mx=mean(x);sdx=sd(x);mz=c(mean=mx,sd=sdx)}
    > aggregate(mydata[c('X1','X2')],by=list(SNP1=SNP1,SNP2=SNP2),f)

    SNP1 SNP2 X1.mean X2.mean
    1 AA BB 0.4007287 0.8677048 6.013512 2.594847
    2 Aa Bb 0.1908830 0.8702564 5.810701 1.806888
    3 aa bb 0.4249260 1.4266786 5.496132 1.824754


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.