Tuesday, May 25, 2010

Use SQL queries to manipulate data frames in R with sqldf package

I've covered a few topics in the past including the plyr package, which is kind of like "GROUP BY" for R, and the merge function for merging datasets. I only recently found the sqldf package for R, and it's already one of the most useful packages I've ever installed. The main function in the package is sqldf(), which takes a quoted string as an argument. You can treat data frames as tables as if they were in a relational database. You can use some of the finer aspects of SQL like the INNER JOIN or the subquery, which are extremely difficult operations to mimic using standard R programming. While this isn't an SQL tutorial, try out some of these commands to see what sqldf can do for you. Read more about the sqldf package here.


> # install the package
> install.packages("sqldf")
> 
> #load it
> library(sqldf)
> 
> # set the random seed
> set.seed(42)
> 
> #generate some data
> df1 = data.frame(id=1:10,class=rep(c("case","ctrl"),5))
> df2 = data.frame(id=1:10,cov=round(runif(10)*10,1))
> 
> #look at the data
> df1
   id class
1   1  case
2   2  ctrl
3   3  case
4   4  ctrl
5   5  case
6   6  ctrl
7   7  case
8   8  ctrl
9   9  case
10 10  ctrl
> df2
   id cov
1   1 9.1
2   2 9.4
3   3 2.9
4   4 8.3
5   5 6.4
6   6 5.2
7   7 7.4
8   8 1.3
9   9 6.6
10 10 7.1
> 
> # do an inner join
> sqldf("select * from df1 join df2 on df1.id=df2.id")
   id class id cov
1   1  case  1 9.1
2   2  ctrl  2 9.4
3   3  case  3 2.9
4   4  ctrl  4 8.3
5   5  case  5 6.4
6   6  ctrl  6 5.2
7   7  case  7 7.4
8   8  ctrl  8 1.3
9   9  case  9 6.6
10 10  ctrl 10 7.1
> 
> # where clauses
> sqldf("select * from df1 join df2 on df1.id=df2.id where class='case'")
  id class id cov
1  1  case  1 9.1
2  3  case  3 2.9
3  5  case  5 6.4
4  7  case  7 7.4
5  9  case  9 6.6
> 
> # lots of sql fun
> sqldf("select df1.id, df2.cov as covariate from df1 join df2 on df1.id=df2.id where class='case' and cov>3 order by cov")
  id covariate
1  5       6.4
2  9       6.6
3  7       7.4
4  1       9.1

11 comments:

  1. Another huge advantage of sqldf is that it is fast! If you need to compute group means for a large dataset, sqldf is great. plyr is great for its generality, but if you have a task that fits into the sql framework and performance matters, I can't recommend sqldf enough.

    ReplyDelete
  2. You should also check out data.table (project is on r-forge). It's super fast over large datasets, too.

    ReplyDelete
  3. data.table is great, but unfortunately you cannot use it completely interchangeably as a data.frame. This means you have to change your usual syntax for a number of operations. I found that to be too much of a hassle to make the switch-over.

    ReplyDelete
  4. On a personal note... What is a patent agent?
    Just curious.

    ReplyDelete
  5. Good day and thanks for the post. This package seems very useful but unfortunately I can't seem to get it to work? This is the error message I get when I try your example:

    > sqldf("select * from df1 join df2 on df1.id=df2.id")
    Error in sqldf("select * from df1 join df2 on df1.id=df2.id") :
    argument "dbname" is missing, with no default
    >

    any ideas?

    Thanks in advance.

    ReplyDelete
  6. Hmm, not sure why the query isn't working for you. Try running it again but put a semicolon after ....df2.id. I should have done that myself. Also, I've run into some problems when both sqldf and the RMySQL package are both loaded. Perhaps try running the same query without loading the RMySQL package.

    ReplyDelete
  7. Regarding the query on May 31st by Anonymous, it does work for me.

    > df1 <- data.frame(id = 1:3, nos = 4:6)
    > df2 <- data.frame(id = 2:4, letters = letters[1:3])
    > sqldf("select * from df1 join df2 on df1.id=df2.id")
    id nos id letters
    1 2 5 2 a
    2 3 6 3 b

    > # You may prefer this
    > sqldf("select * from df1 join df2 using(id)")
    id nos letters
    1 2 5 a
    2 3 6 b

    ReplyDelete
  8. Just saw another blog with some examples of sqldf here. http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/

    ReplyDelete
  9. this is very useful for me.
    Thanks you

    ReplyDelete
  10. how to use sqldf for making joins on 2 or more columns. concat(coulmn_1,column_2) does not work.

    city_pincode_air <- sqldf("select t1.*, t2.Increase_by*24 as increase from CPA_S2D as t1 inner join City_City_pincode_rawdata as t2 on
    concat(upper(t1.SELLERCITY) ,t1.PINCODE)= concat(upper(t2.From_City), t2.Dest_Pincode")

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.