tag:blogger.com,1999:blog-6232819486261696035.post762893812191639875..comments2023-09-25T09:01:44.323-05:00Comments on Getting Genetics Done: Split a Data Frame into Testing and Training Sets in RStephen Turnerhttp://www.blogger.com/profile/06656711316726116187noreply@blogger.comBlogger27125tag:blogger.com,1999:blog-6232819486261696035.post-38851603162702216812014-09-03T02:34:27.397-05:002014-09-03T02:34:27.397-05:00Hello everyone. Great job for this function, congr...Hello everyone. Great job for this function, congratulations and thank you all. However, the test subset has (follows) the samen order than within full (i.e. original) data set (-> dataframe). Following I show a little modification of the function to sample the test set.<br /><br /> <br /><br />plitdf <- function(dataframe, seed=NULL) {<br /> if (!is.null(seed)) set.seed(seed)<br /> index_train <- 1:nrow(dataframe)<br /> trainindex <- sample(index_train, trunc(length(index_train)/2), replace = FALSE)<br /> trainset <- dataframe[trainindex, ]<br /> testset_0 <- dataframe[-trainindex, ] # This subset follow the same order than original dataframe<br /> <br /> index_test <- 1:nrow(testset_0)<br /> testindex <- sample(index_test, trunc(length(index_test)), replace = FALSE)<br /> testset <- testset_0[testindex, ]<br /> <br /> # View(testset_0);View(testset); # for debugging<br /> <br /> list(trainset=trainset,testset=testset)<br /> }Germanhttps://www.blogger.com/profile/02061541084665117242noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-6067327828424905682014-07-16T11:25:57.693-05:002014-07-16T11:25:57.693-05:00hello everyone, please i would like to split the d...hello everyone, please i would like to split the data i am working on into 70% trainset and 30% trainset randomly 150 times! with each time having different combination of samples. Kindly help on this.Anonymoushttps://www.blogger.com/profile/09991990830316556251noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-47755822782633125052014-03-26T02:35:26.692-05:002014-03-26T02:35:26.692-05:00hello everyone, how can i get the coefficients for...hello everyone, how can i get the coefficients for my model that i have generated with this function. the normal, summary(model) or coefficients(model) are not giving me desired results. I want to get the intercept and the predictors interceptsChisomohttps://www.blogger.com/profile/13564977585650752795noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-9977103833645356232014-02-10T09:41:17.464-06:002014-02-10T09:41:17.464-06:00works well thanksworks well thanksAnonymoushttps://www.blogger.com/profile/03580260144068423705noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-74620507567651650822014-02-01T20:43:04.965-06:002014-02-01T20:43:04.965-06:00This post was helpful! I used your function and mo...This post was helpful! I used your function and modified it to control the size of the respective train and test sets. I didn't know about the createDataPartition. I'll try to incorporate this into workflow for some cross validation I need to do. <br /><br />The modification I made was to add the desired size of the train, with the complement still going into the test set.<br /><br />splitDataFrame <- function(dataframe, seed = NULL, n = trainSize) {<br /> if (!is.null(seed)) set.seed(seed)<br /> index <- 1:nrow(dataframe)<br /> trainindex <- sample(index, n)<br /> trainset <- dataframe[trainindex, ]<br /> testset <- dataframe[-trainindex, ]<br /> list(trainset = trainset, testset = testset)<br />}<br /><br />With example of calling the function with desired size for the train set, 72% of records to train set with 28% to test: <br /><br />dataList <- splitDataFrame(workingData, NULL, round(nrow(workingData) * 0.72))<br />train <- dataList$trainset<br />test <- dataList$testsetwww.phillipburger.net/wordpresshttps://www.blogger.com/profile/03119393541761375560noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-9985826094015293302013-11-19T03:20:21.276-06:002013-11-19T03:20:21.276-06:00Thanks for Sharing! It's very helpfulThanks for Sharing! It's very helpful<br />vijay varadihttps://www.blogger.com/profile/15457692595297561765noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-68956566679137035152013-09-21T14:41:57.044-05:002013-09-21T14:41:57.044-05:00Thanks for sharing! that was quite useful! But con...Thanks for sharing! that was quite useful! But consider a dataset in which you have subjects divided into 2 groups(patients and control subjects) and you wish to split the main dataset in training and test set (as you did) WHILE maintaining the proportions of subjects in group1 and group2 in each set at the same ratio as in the complete data set. How would you do that? Siavash Delfanihttps://www.blogger.com/profile/18162195907715668935noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-64551949434360778072013-09-20T17:43:30.949-05:002013-09-20T17:43:30.949-05:00This comment has been removed by the author.Siavash Delfanihttps://www.blogger.com/profile/18162195907715668935noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-83678041868571616472013-09-20T17:02:04.060-05:002013-09-20T17:02:04.060-05:00This comment has been removed by the author.Siavash Delfanihttps://www.blogger.com/profile/18162195907715668935noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-34962830637683207882013-06-26T10:21:37.676-05:002013-06-26T10:21:37.676-05:00Hello everyone, could anyone help me how to calcul...Hello everyone, could anyone help me how to calculate the confusion matrix?<br />Thank you in advanceAnonymoushttps://www.blogger.com/profile/06005701183819245739noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-29567995001020387492013-03-25T21:09:16.221-05:002013-03-25T21:09:16.221-05:00Thank you for this function, helped me a lot on my...Thank you for this function, helped me a lot on my assignment. (Also, duly cited :) )Anonymoushttps://www.blogger.com/profile/00003840559730893084noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-27779713427664529352013-02-25T08:33:32.666-06:002013-02-25T08:33:32.666-06:00Thank you very much! I'm modifying your functi...Thank you very much! I'm modifying your function so I can input whatever size of training data set I want. Don't worry, I'll cite you in my assignment :-)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-67390112895783477162013-01-09T07:00:10.887-06:002013-01-09T07:00:10.887-06:00Modify this line:
trainindex <- sample(index, ...Modify this line:<br /><br />trainindex <- sample(index, trunc(length(index)/2))Stephen Turnerhttps://www.blogger.com/profile/06656711316726116187noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-60935984326759307812013-01-09T04:27:59.281-06:002013-01-09T04:27:59.281-06:00how did you split it to 1/3 test and 2/3 training?...how did you split it to 1/3 test and 2/3 training?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-59888461951736466722012-12-09T16:47:44.831-06:002012-12-09T16:47:44.831-06:00Yep, that's exactly what I'm doing with th...Yep, that's exactly what I'm doing with the set seed.Stephen Turnerhttps://www.blogger.com/profile/06656711316726116187noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-1467636587424645192012-12-09T15:51:02.026-06:002012-12-09T15:51:02.026-06:00Thanks for this, I found it very useful. If I'...Thanks for this, I found it very useful. If I'm not mistaken, the seed setting in your function offers something that neither split nor createDatapartition has, the ability to reconstruct your sets later... is this correct? Orville Jacksonhttps://www.blogger.com/profile/01185387323964061794noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-60810784470812844022012-08-11T04:18:52.653-05:002012-08-11T04:18:52.653-05:00Thanks for sharing, I will bookmark and be back ag...Thanks for sharing, I will bookmark and be back again<br /><br /><br /><br /><br /><br /><br /><br /><a href="http://www.amitysoft.com/careertraining.html" rel="nofollow">Testing Training with Live Project</a>ranjinihttps://www.blogger.com/profile/05500353013507318574noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-48627177788161369092012-01-12T18:38:15.436-06:002012-01-12T18:38:15.436-06:00Thanks a lot! It also works if I want 1/3 in test ...Thanks a lot! It also works if I want 1/3 in test and 2/3 in training.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-70189744088859970702011-02-25T10:35:54.286-06:002011-02-25T10:35:54.286-06:00You can use your test set with two different purpo...You can use your test set with two different purposes:<br /><br />1) You use it as a completely independent test set in order to be able to have an estimate about how well your model will behave in completely unseen data.<br /><br />In this case you should never use your test set to choose your model. That means that you shouldn't repeat an experiment because you are not satisfied with the result in the test set.<br /><br />2) You use it as a validation set to choose your best model. <br /><br />In this case you might have more confidence that you are choosing the right model, but the result you get at the end is biased and you can't use it as an estimate of the error of your model in unseen data.<br /><br />Finally, I also prefer k-fold cross-validation to choose the best algorithms and adjust parameters. With this approach, the result cannot be used as an estimate of the error in unseen data, but the methodology is very good to compare performances of the different algorithms and parameters.Orlando Anunciaçãohttp://kdbio.inesc-id.pt/~orlandonoreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-91146599984736506292011-02-25T08:40:52.696-06:002011-02-25T08:40:52.696-06:00You may also want to take a look at Friedman's...You may also want to take a look at Friedman's stochastic gradient boosting (package is available in R). It's been shown to outperform Random Forest.<br /><br />I will also agree with another comment on test set not neccesarily needed. If you do the cross-validation, it gives quite accurate prediction of the error.ZChttps://www.blogger.com/profile/13200285398180829713noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-341878728977281142011-02-25T08:01:03.200-06:002011-02-25T08:01:03.200-06:00I agree with Jan.
Max Kuhn's caret package ha...I agree with Jan.<br /><br />Max Kuhn's caret package has many more desirable features to compose training and test sets.<br /><br />+ he replies promptly if you have questions :-)Unknownhttps://www.blogger.com/profile/01210015562361794495noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-48596316108514551402011-02-25T05:46:52.719-06:002011-02-25T05:46:52.719-06:00Hi
I don't know why you need to split data int...Hi<br />I don't know why you need to split data into training and test sets. random Forest produces its own 'out of bag' error estimate by growing trees on two thirds and testing on one third of the data (or thereabouts) whilst running. So you should use the whole dataset.<br /><br />For the other methods there are R packages that handle cross validation and feature selection automatically like 'ipred' and 'MLInterfaces'. Cross validation is far superior to a single train and test set. Indeed you should only choose a single train and test set if they were drawn separately i.e. a model based on patients from a clinical trial is tested on patients from a new hospital.Stephenhttp://rforcancer.drupalgardens.com/noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-76029148223873282352011-02-25T05:39:08.262-06:002011-02-25T05:39:08.262-06:00This comment has been removed by the author.Stephen Hendersonhttps://www.blogger.com/profile/12484961735776246487noreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-45176712987615790112011-02-25T01:51:26.408-06:002011-02-25T01:51:26.408-06:00Have a look at package caret. It will do the job (...Have a look at package caret. It will do the job (including the selection of training/test sets).Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-6232819486261696035.post-60124895396799725662011-02-25T01:49:44.372-06:002011-02-25T01:49:44.372-06:00Take a look at the caret package. It does this and...Take a look at the caret package. It does this and a lot more.Unknownhttps://www.blogger.com/profile/06253936398902361227noreply@blogger.com