Sample Rows from a Data Frame that Exclude the ID Values from Another Sample in R

In order to do some modeling, I needed to make a training sample and a test sample from a larger data frame.  Making the training sample was easy enough (see my earlier post), but I was going crazy trying to figure out how to make a second sample that excluded the rows I had already sampled in the first sample.

After trying out some options myself, looking extensively on the net, and asking for help on the r-help forum, I came up with the following function that finally does what I need it to do:

 

To summarize the function, you enter in the big data frame first (here termed “main.df”), then your first sample data frame that has the ID values that you want to exclude (here termed “sample1.df”, then your sample size, then the ID variable names in both data frames enclosed in quotes.  

Functions like this certainly make my working life with R easier in preventing me from having to type in syntax like that every time I want that kind of a task done.

Advertisements

8 thoughts on “Sample Rows from a Data Frame that Exclude the ID Values from Another Sample in R

  1. This method seems like overkill. Typically, one would just create a new vector with a random number associated with each row and then use it to divide up the data frame as follows:rand<-runif(nrow(df))>test<-df[which(rand><.5),]>train<-df[which(rand>.5),]I suppose you’d have to use your version if your id variable was repeated. But in most cases when that’s not the case this should work fine.

  2. Brad, I like your solution. Syntactically, it’s a LOT more elegant looking than the one I was able to find¸which is really nice. I suppose the only possible difference for me is in scalability:When working with a dataset with >20,000 or 100,000 rows, is it going to make no significant difference in processing time, or make me wait longer once I implement it?Bear in mind that the main data frame, and the first sample both already have an ID variable in them, so my method, as syntactically ugly/yucky as it is, doesn’t involve adding new vectors.Thanks very much for commenting! I really appreciate the input.

  3. I suppose the question of which method to use is really determined by whether you’re limited by processor speed or memory. the %in% part of your code should be fairly processor intensive, but I suppose it’s true that it won’t take up more memory.I work with fairly wide data sets, so removing another column really doesn’t make much difference to me in terms of memory usage. The one point I’d add though is that if you are worried about memory usage, you don’t really need to create a new data frame at all. Any function in R will accept a dataframe with a which() argument inside it (like I posted above), so if you’re memory-limited, you can just specify your training and test sets with a which() function on the fly.

  4. OOPS, posted too early, this is what I meant to post:I suppose the question of which method to use is really determined by whether you’re limited by processor speed or memory. the %in% part of your code should be fairly processor intensive, but I suppose it’s true that not adding another vector will save a bit of memory.I work with fairly wide data sets, so adding another column for a random number really doesn’t make much difference to me in terms of memory usage. The one point I’d add though is that if you are worried about memory usage, you don’t really need to create a new data frame at all, as we’ve both done in our examples.Any function in R will accept a dataframe with a which() argument inside it (like I posted above), so if you’re memory-limited, you can just specify your training and test sets with a which() function on the fly, i.e. rather than doing lm(y~.,dat=train.df) you could do lm(y~.,dat=df[which(rand<.5),])

  5. How about (sorry if the formatting is bad… I don’t know how to pretty up in this editor):get.sample <-> inds <-> if(!is.null(size)) inds <-> structure(data[inds,], rows=inds)}which can be used for both your training and test samples. It does not depend on the variables on your data frame.#Use like (to get training of 100 rows and testing of 200 rows):d.n <->data <- data=”true”>training <->testing <->#if you wanted the rest of the rowsget.sample(data,, c(attr(training, ‘rows’), attr(testing, ‘rows’))

  6. … Now I know I apologised for the formatting… but that is ridiculous. You need to get a better editor. Using = instead of the assignment arrow:get.sample = function(data, size=NULL, exclude=NULL) { inds = if(is.null(exclude)) 1:nrow(data) else setdiff(1:nrow(data), exclude) if(!is.null(size)) inds = sample(inds, size) structure(data[inds,], rows=inds)}d.n = 1000data = data.frame(x=rnorm(d.n), y=rnorm(d.n))training = get.sample(data, 100)testing = get.sample(data, 200, attr(training, ‘rows’))rest.of.rows = get.sample(data, , c(attr(training, ‘rows’), attr(testing, ‘rows’)))

  7. I’m not sure if it would work the same in comments, but try putting your code in Gist (https://gist.github.com/) and then embedding it in the comment if you want the formatting to look nicer… Posterous’s default formatting isn’t the greatest…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s