sapply is my new friend!

I’ve written previously about how the apply function is a major workhorse in many of my work projects. What I didn’t know is how handy the sapply function can be!

There are a couple of cases so far where I’ve found that sapply really comes in handy for me:

1) If I want to quickly see some descriptive stats for multiple columns in my dataframe. For example,

sapply(mydf[,10:20], median, na.rm=true)

would show me the medians of columns 10 through 20, displaying the column names above each median value.

2) If I want to apply the same function to multiple vectors in my dataframe, modifying them in place. I oftentimes have count variables that have NA values in place of zeros. I made a “zerofy” function to add zeros into a vector that lacks them. So, if I want to use my function to modify these count columns, I can do the following:

mydf[,30:40] = sapply(mydf[,30:40], zerofy)

Which then replaces the original data in columns 30 through 40 with the modified data! Handy!


9 thoughts on “sapply is my new friend!

  1. Number 2) can also be rewritten to
    mydf[,30:40][mydf[,30:40] == NA] <- 0 to change the values in place. Should be a bit faster if you have thousands of columns (haven't tested it though…).


  2. I use sapply frequently (and mapply, still don’t know the difference other than the order of the first two arguments) but I haven’t found any more improvement in computing speed over regular ‘for’ loops that achieve the same purpose. Have you tried timing sapply and a loop that does the same thing?

      • “for-next” sometimes is a bit faster when you preallocate an empty result vector which you then fill::

        > x system.time(res res system.time(for (i in 1:1000000) res[i] system.time(res <- sqrt(abs(x)))
        user system elapsed
        0.05 0.02 0.05

      • Okay, I’ll give you that one. Code simplicity is a valid point. To be honest though, I find for loops more intuitive to read than sapply, although they are longer.

  3. If you like sapply you’re gonna LOVE plyr. If you’ll indulge a bit of shameless self-promotion, here are some slides from a tutorial I gave last summer…

    Also, fair warning, plyr can be substantially slower than for loops or base-apply functions, but my experience has been that it doesn’t make a big difference unless you’re working on seriously big data sets, and that the improvements in code clarity and flexibility are worth it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s