The apply function in R is a huge work-horse for me across many projects. My usage of it is pretty stereotypical. Usually, I use it to make aggregations of a targeted group of columns for every row in a dataframe. Those aggregations could be counts, sum totals, or possibly a binary column that flags some condition based on a count or sum total from the entire group of columns.
In my latest project, I found a different usage for the apply function. I had a dataframe detailing the donations made each year, by individual people, for the last 7 years. I also had a column specifying the year when they converted from being a ‘regular’ donor to donating through a specific program. The idea was to create a column specifying the last yearly amount donated before the year they had converted. In other words, for each row, information had to be extracted on the year they converted, when they had donated before that year, and how much was that amount.
Here’s the R code I used with some sample data to help recreate what I did:
test = matrix(c(31, NA, 33, 37, NA, 2011, NA, 33, NA, 31, 37, 2012, 37, 33, 31, NA, NA, 2009), nrow=3, ncol=6, byrow=TRUE, dimnames=list(c(1,2,3),c(seq(2008,2012),'year'))) value.b4.year = apply(test[,1:6], 1, function (x) ifelse(any(!is.na(x[1:5])), x[max(which(!is.na(x[1:which(seq(2008,2012) == x) -1])))], NA))
First, I set up the sample data. It’s obviously pretty simple, and the dataframe I used in reality had some rows where there were all NA values throughout the first 5 columns.
It’s in the apply function where the real magic begins. First, we load up all relevant columns into the apply functions for each row (test[,1:6]). You can specify/insert whichever columns you need from your dataframe, so long as you use c() in the indexing brackets when you’re referencing your dataframe. Once you load them up, no matter what the column indices/numbers, they become columns 1 through the total number of columns you loaded up as far as the rest of the apply function is concerned. (Say you load up 15 different columns indices, all spread out in terms of their relative position numbers. Regardless of the position each index represents, those columns become 1:15 when referring to them in the subsequent function call).
After the function (x) part of the function, I put a condition stating the the following value assignment should only happen if there is any data in the first 5 elements of the row, otherwise an NA gets assigned. Notice that you don’t have to operate on, or select, every single column that you’ve loaded into the apply statement; very important!
The next line is dizzifyingly complex, so let me break it down for you:
x[1:which(seq(2008,2012) == x) -1]
In this part of the line, I’m calling up all elements of the row, from the first element, until the element before the one corresponding to the year listed in the sixth element (i.e. 1:end year). We start with seq(2008,2012) == x, which marks the specific year that corresponds with the value of the sixth element. Feeding that statement into the which function gives you the position in the row represented by x of the year listed in the sixth element. Subtracting by 1 sets the end index as that column just before the one corresponding with the year.
To show you what the above gets you on line 1 of the sample data:
test[1,1:which(seq(2008,2012) == test[1,6]) -1] 2008 2009 2010 31 NA 33
Next, by nesting the above statement in the which and !is.na function calls, we eliminate those elements in the row that have NA values in them and find the positions where there is some value.
which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])) 2008 2010 1 3
Now that we have the positions of all values in the row that occurred before the year listed, we just need to select the highest value, by nesting the above function call inside the a call to the max function:
max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1]))) 3
Right, so the the last value before the year in column 6 for row 1 is in position 3! That finally leads us to the syntax I used in the apply statement (somewhat different here, because we’re looking at one row of the sample data):
test[1,max(which(!is.na(test[1,1:which(seq(2008,2012) == test[1,6]) -1])))] 33
So, for row 1, the last donation amount before conversion was 33. Let’s look at the results of the apply statement from above:
value.b4.year 1 2 3 33 31 37
Yup, for each of these rows, the values saved in value.b4.year are the last values donated before the year specified in the sixth column.
So, the lesson here is that when you are using apply to perform an operation on each row of a dataframe, the value of any one column from that row is easily accessible by referring to the position of that value relative the entire group of columns that you have passed to the apply function. Hopefully this will help you in the future as I know it will help me :)