Monday, July 27, 2009

R Snippet for Sampling from a Dataframe

It took me a while to figure this out, so I thought I'd share.

I have a dataframe with millions of observations in it, and I want to estimate a density distribution, which is a memory intensive process. Running my kde2d function on the full dataframe throws and error -- R tries to allocate a vector that is gigabytes in size. A reasonable alternative is to run the function on a smaller subset of the data. R has a nifty sample function, but it is designed to randomly sample from a vector, not a 2D dataframe. The sample function CAN work for this though, like so:
sub <- Dataset[sample(1:dim(Dataset)[1], size=100000, replace=FALSE),]


Now, sub contains a subset from my 2D dataframe containing 100000 observations. This example was a sample without replacement, but if you set replace=TRUE, you can get a sample with replacement also.

5 comments:

  1. nice! or (taking the defaults for sample and wanting to sample at 10%):

    sub=Dataset[sample(nrow(Dataset)*0.10),]

    ReplyDelete
  2. Cheers for the handy tip. This saved me some time.

    ReplyDelete
  3. Thank you for sharing!

    ReplyDelete
  4. Thank you for sharing. If there is a weight associated with each observation in the data frame, how can I sample from this data frame? Thanks.

    ReplyDelete
    Replies
    1. A weight in the sense that you want this sample to be drawn proportional to its weight? If that's the case the simplest solution I can think of is to create a new data frame with the sample repeated proportional to its weight and use the code above. Simple but inefficient (could be problematic for very large data). If you need efficiency without creating a new data frame you might try drawing a uniform random number, comparing it to the weight, and selecting the sample if it is above the weight, or something like that.

      Delete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.