What if you wanted to split a dataset into training/testing sets but ensure that there are no significant differences between a variable of interest across the two splits?
For example, if we use the splitdf() function from last time to split up the iris dataset, setting the random seed to 44, it turns out the outcome variable of interest, Sepal.Length, differs significantly between the two splits.
splitdf <- function(dataframe, seed=NULL) { if (!is.null(seed)) set.seed(seed) index <- 1:nrow(dataframe) trainindex <- sample(index, trunc(length(index)/2)) trainset <- dataframe[trainindex, ] testset <- dataframe[-trainindex, ] list(trainset=trainset,testset=testset) } data(iris) s44 <- splitdf(iris, seed=44) train <- s1$trainset test <- s1$testset t.test(train$Sepal.Length, test$Sepal.Length)
What if we wanted to ensure that the means of Sepal.Length, as well as the other continuous variables in the dataset, do not differ between the two splits?
Again, this is probably something that's already available in an existing package, but I quickly wrote another function to do this. It's called splitdf.randomize(), which depends on splitdf() from before. Here, you give splitdf.randomize() your data frame you want to split, and a character vector containing all the columns you want to keep balanced between the two splits. The function is a wrapper for splitdf(). It randomly makes a split and does a t-test on each column you specify. If the p-value on that t-test is less than 0.5 (yes, 0.5, not 0.05), then the loop will restart and try splitting the dataset again. (Currently this only works with continuous variables, but if you wanted to extend this to categorical variables, it wouldn't be hard to throw in a fisher's exact test in the while loop)
For each iteration, the function prints out the p-value for the t-test on each of the variable names you supply. As you can see in this example, it took four iterations to ensure that all of the continuous variables were evenly distributed among the training and testing sets. Here it is in action:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#splitdf splits a data frame into a training and testing set. | |
#returns a list of two data frames: trainset and testset. | |
#you can optionally apply a random seed. | |
splitdf <- function(dataframe, seed=NULL, trainfrac=0.5) { | |
if (trainfrac<=0 | trainfrac>=1) stop("Training fraction must be between 0 and 1, not inclusive") | |
if (!is.null(seed)) set.seed(seed) | |
index <- 1:nrow(dataframe) | |
trainindex <- sample(index, trunc(length(index)/(1/trainfrac))) | |
trainset <- dataframe[trainindex, ] | |
testset <- dataframe[-trainindex, ] | |
list(trainset=trainset,testset=testset) | |
} | |
#this function utilizes the function above. | |
#you give it a data frame you want to randomize, | |
#and a character vector with column names you want to be sure are | |
#equally distributed among the two different sets. | |
#these columns must be continuous variables. chi2 not yet implemented. | |
splitdf.randomize <- function(dataframe, ttestcolnames=c("cols","to","test"), ...) { | |
d <- dataframe | |
if (!all(ttestcolnames %in% names(d))) stop(paste(ttestcolnames,"not in dataframe")) | |
ps <- NULL | |
while (is.null(ps) | any(ps<.5)) { | |
sets <- splitdf(d, trainfrac=...) | |
trainset <- sets$trainset | |
testset <- sets$testset | |
ttestcols <- which(names(d) %in% ttestcolnames) | |
ps <- NULL | |
for (col in ttestcols) { | |
p <- t.test(trainset[ ,col], testset[ ,col])$p.value | |
ps=c(ps,p) | |
} | |
print(paste(ttestcolnames," t-test p-value =",ps)) | |
cat("\n") | |
} | |
list(trainset=trainset,testset=testset) | |
} | |
# sometimes you might have significant differences in variables of interest | |
# between training and testing sets. | |
data(iris) | |
s44 <- splitdf(iris, seed=44) | |
train <- s44$trainset | |
test <- s44$testset | |
t.test(train$Sepal.Length, test$Sepal.Length) | |
#first, specify which columns you want to ensure are "even" between the sets | |
cols <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width") | |
#Now, split up the dataset again, keeping even distribution of those variables. | |
set.seed(80842) | |
evensplit <- splitdf.randomize(iris,cols) |
P-values to judge balance? When the null hypothesis does hold (mean of training and testing populations are equal), which does hold because of randomization, then the p-value is uniformly distributed over [0,1]. Thus, for a single covariate, the probability you'll have to re-randomize is 1/2; for k covariates, the probability is 1-(1/2)^k. I'm not surprised it took you four tries to get balance based on your criteria.
ReplyDeleteI think we should appreciate the randomness that occurs, because that is what is allowing us to estimate the out-of-sample error from the testing set.
Stephen, could you elaborate on the rationale for choosing p = 0.5 rather than 0.05?
ReplyDeleteMy thinking was to make sure there was no remotely significant difference between the columns to randomize on.
Delete