Comments on Getting Genetics Done: Split a Data Frame into Testing and Training Sets in R

Hello everyone. Great job for this function, congr...

2014-09-03T02:34:27.397-05:00

Hello everyone. Great job for this function, congratulations and thank you all. However, the test subset has (follows) the samen order than within full (i.e. original) data set (-> dataframe). Following I show a little modification of the function to sample the test set.

plitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index_train <- 1:nrow(dataframe)
trainindex <- sample(index_train, trunc(length(index_train)/2), replace = FALSE)
trainset <- dataframe[trainindex, ]
testset_0 <- dataframe[-trainindex, ] # This subset follow the same order than original dataframe

index_test <- 1:nrow(testset_0)
testindex <- sample(index_test, trunc(length(index_test)), replace = FALSE)
testset <- testset_0[testindex, ]

# View(testset_0);View(testset); # for debugging

list(trainset=trainset,testset=testset)
}

hello everyone, please i would like to split the d...

2014-07-16T11:25:57.693-05:00

hello everyone, please i would like to split the data i am working on into 70% trainset and 30% trainset randomly 150 times! with each time having different combination of samples. Kindly help on this.

hello everyone, how can i get the coefficients for...

2014-03-26T02:35:26.692-05:00

hello everyone, how can i get the coefficients for my model that i have generated with this function. the normal, summary(model) or coefficients(model) are not giving me desired results. I want to get the intercept and the predictors intercepts

works well thanks

2014-02-10T09:41:17.464-06:00

works well thanks

This post was helpful! I used your function and mo...

2014-02-01T20:43:04.965-06:00

This post was helpful! I used your function and modified it to control the size of the respective train and test sets. I didn't know about the createDataPartition. I'll try to incorporate this into workflow for some cross validation I need to do.

The modification I made was to add the desired size of the train, with the complement still going into the test set.

splitDataFrame <- function(dataframe, seed = NULL, n = trainSize) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, n)
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset = trainset, testset = testset)
}

With example of calling the function with desired size for the train set, 72% of records to train set with 28% to test:

dataList <- splitDataFrame(workingData, NULL, round(nrow(workingData) * 0.72))
train <- dataList$trainset
test <- dataList$testset

Thanks for Sharing! It's very helpful

2013-11-19T03:20:21.276-06:00

Thanks for Sharing! It's very helpful

Thanks for sharing! that was quite useful! But con...

2013-09-21T14:41:57.044-05:00

Thanks for sharing! that was quite useful! But consider a dataset in which you have subjects divided into 2 groups(patients and control subjects) and you wish to split the main dataset in training and test set (as you did) WHILE maintaining the proportions of subjects in group1 and group2 in each set at the same ratio as in the complete data set. How would you do that?

2013-09-20T17:43:30.949-05:00

This comment has been removed by the author.

2013-09-20T17:02:04.060-05:00

This comment has been removed by the author.

Hello everyone, could anyone help me how to calcul...

2013-06-26T10:21:37.676-05:00

Hello everyone, could anyone help me how to calculate the confusion matrix?
Thank you in advance

Thank you for this function, helped me a lot on my...

2013-03-25T21:09:16.221-05:00

Thank you for this function, helped me a lot on my assignment. (Also, duly cited :) )

Thank you very much! I'm modifying your functi...

2013-02-25T08:33:32.666-06:00

Thank you very much! I'm modifying your function so I can input whatever size of training data set I want. Don't worry, I'll cite you in my assignment :-)

Modify this line: trainindex <- sample(index, ...

2013-01-09T07:00:10.887-06:00

Modify this line:

trainindex <- sample(index, trunc(length(index)/2))

how did you split it to 1/3 test and 2/3 training?...

2013-01-09T04:27:59.281-06:00

how did you split it to 1/3 test and 2/3 training?

Yep, that's exactly what I'm doing with th...

2012-12-09T16:47:44.831-06:00

Yep, that's exactly what I'm doing with the set seed.

Thanks for this, I found it very useful. If I'...

2012-12-09T15:51:02.026-06:00

Thanks for this, I found it very useful. If I'm not mistaken, the seed setting in your function offers something that neither split nor createDatapartition has, the ability to reconstruct your sets later... is this correct?

Thanks for sharing, I will bookmark and be back ag...

2012-08-11T04:18:52.653-05:00

Thanks for sharing, I will bookmark and be back again

Testing Training with Live Project

Thanks a lot! It also works if I want 1/3 in test ...

2012-01-12T18:38:15.436-06:00

Thanks a lot! It also works if I want 1/3 in test and 2/3 in training.

You can use your test set with two different purpo...

2011-02-25T10:35:54.286-06:00

You can use your test set with two different purposes:

1) You use it as a completely independent test set in order to be able to have an estimate about how well your model will behave in completely unseen data.

In this case you should never use your test set to choose your model. That means that you shouldn't repeat an experiment because you are not satisfied with the result in the test set.

2) You use it as a validation set to choose your best model.

In this case you might have more confidence that you are choosing the right model, but the result you get at the end is biased and you can't use it as an estimate of the error of your model in unseen data.

Finally, I also prefer k-fold cross-validation to choose the best algorithms and adjust parameters. With this approach, the result cannot be used as an estimate of the error in unseen data, but the methodology is very good to compare performances of the different algorithms and parameters.

You may also want to take a look at Friedman's...

2011-02-25T08:40:52.696-06:00

You may also want to take a look at Friedman's stochastic gradient boosting (package is available in R). It's been shown to outperform Random Forest.

I will also agree with another comment on test set not neccesarily needed. If you do the cross-validation, it gives quite accurate prediction of the error.

I agree with Jan. Max Kuhn's caret package ha...

2011-02-25T08:01:03.200-06:00

I agree with Jan.

Max Kuhn's caret package has many more desirable features to compose training and test sets.

+ he replies promptly if you have questions :-)

Hi I don't know why you need to split data int...

2011-02-25T05:46:52.719-06:00

Hi
I don't know why you need to split data into training and test sets. random Forest produces its own 'out of bag' error estimate by growing trees on two thirds and testing on one third of the data (or thereabouts) whilst running. So you should use the whole dataset.

For the other methods there are R packages that handle cross validation and feature selection automatically like 'ipred' and 'MLInterfaces'. Cross validation is far superior to a single train and test set. Indeed you should only choose a single train and test set if they were drawn separately i.e. a model based on patients from a clinical trial is tested on patients from a new hospital.

2011-02-25T05:39:08.262-06:00

This comment has been removed by the author.

Have a look at package caret. It will do the job (...

2011-02-25T01:51:26.408-06:00

Have a look at package caret. It will do the job (including the selection of training/test sets).

Take a look at the caret package. It does this and...

2011-02-25T01:49:44.372-06:00

Take a look at the caret package. It does this and a lot more.