Using ddply to select the first record of every group

I had a very long file of monetary transactions (about 207,000 rows) with about two handfuls of columns describing each transaction (including date).  The task I needed to perform on this file was to select the value from one of the categorical descriptor columns (called “appeal”) associated with the first transaction found for every ID in the file.  Luckily for me, the file was organized so that the older transactions came before the newer transactions.  The coding solution I used was pretty simple, but took maybe 5 – 10 minutes to complete.

Assuming the data frame is called ‘trans.file’, the numerical ID is called, ‘Num.ID’, and the categorical descriptor column of interest is called ‘appeal’, here’s the simple code:

first.appeal.by.id = ddply(trans.file, “Num.ID”, function (x) as.character(x$appeal[1]))

I do like simple, but I also like fast.  Is there a less computationally expensive way of doing this?  Please tell me if so :)

About these ads

11 thoughts on “Using ddply to select the first record of every group

  1. data.table package, something like

    dt <- data.table(trans.file, key=c('Num.ID'))
    dt[,.SD[1,], by=keys(dt)].

    On my iPad or I'd double check to get it right. Data.table is much faster than plyr when the number of groups get really large (like going by ID).

  2. I’ll second Jason in looking at data.table.

    Assuming (perhaps wrongly) that Num.ID are going to be increasing:

    DT <- data.table(trans.file, key=c('Num.ID'))
    DT[,min(Num.ID),by=appeal]

    Have a look at datatable-intro section 2. Fast Grouping

  3. have you tried something like:
    dt.sorted <- dt[order(dt$Num.ID,date),]
    dt.first <- dt.sorted[c(TRUE,dt.sorted$Num.ID[1:(nrow(df)-1)] != df.sorted[2:nrow(df),1]),]
    essentially, it sorts over Num.ID and date and takes the first one. sort and vector comparison can occasionally be fast and are relatively optimized.

    • ah, since the data is already date sorted, we can sort only on Num.ID and use the stability of radix sort to make sure we get the first date for each ID.
      This method depends on ID being a small integer.

      The R code would be:
      dt.sorted <- dt[sort.list(as.integer(dt$Num.ID),method="radix"),]
      instead of:
      dt.sorted <- dt[order(dt$Num.ID,date),]
      which is possibly quicker.

  4. If Im understanding you correctly…I would have thought something like this would be direct to underlying C code. And shouldn’t over trouble a small PC or average laptop.

    # first occurrences
    first_occs= with(trans.file, match(unique(Num.ID), Num.ID))
    trans.file[first_occs,appeal.col]

  5. Did you try something like this:

    #Assuming the trans.file is already sorted, as you indicated above.
    first <- trans.file[!duplicated(trans.file$Num.ID), 'appeal']

    I use the duplicated function all the time, but never compared its speed with anything else.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s