KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

The challenge from the KDD Cup this year was to use their data relating to student enrollment in online MOOCs to predict who would drop out vs who would stay.

The short story is that using H2O and a lot of my free time, I trained several hundred GBM models looking for the final one which eventually got me an AUC score of 0.88127 on the KDD Cup leaderboard and at the time of this writing landed me in 120th place. My score is 2.6% away from 1st place, but there are 119 people above me!

Here are the main characters of this story:

mariadb
MySQL Workbench
R
H2O

It started with my obsessive drive to find an analytics project to work on. I happened upon the KDD Cup 2015 competition and decided to give it a go. It had the characteristics of a project that I wanted to get into:

1) I could use it to practice my SQL skills
2) The data set was of a moderate size (training table was 120,542 records, log info table was 8,151,053 records!)
3) It looked like it would require some feature engineering
4) I like predictive modeling competitions ūüôā

Once I had loaded up the data into a mariadb database, I had to come to decisions about how I would use the info in each table. Following were my thought processes for each table:

enrollment_train / enrollment_test
Columns: enrollment_id, username, course_id

Simply put, from this table I extracted the number of courses each student (username) was enrolled in, and also the number of students enrolled in each course (course_id).

log_train / log_test
Columns: enrollment_id, tstamp, source, logged_event, object

There were a few items of information that I decided to extract from this table:

1) Number of times each particular event was logged for every enrollment_id
2) Average timestamp for each event for each enrollment_id
3) Min and Max timestamp for each event for each enrollment_id
4) Total time elapsed from the first to the last instance of each event for each enrollment_id
5) Overall average timestamp for each enrollment_id

Contrary to what you might think, the object field does not seem to link up with the object table.

object
Columns: course_id, module_id, category, children, tstart

From this table I extracted a count of course components by course_id and also the number of ‘children’ per course_id. I assume these are relational references but am not sure what in the data set these child IDs refer to.

truth_train
Columns: enrollment_id, dropped_out

I didn’t extract anything special out of this table, but used it as the table to which all other SQL views that I had created were linked.

If you’d like to see the SQL code I used to prepare the tables, views, and the final output table I used to train the model, see my github repo for this project.

Import into R and Feature Engineering

Once I imported the data into R through RODBC, you’ll see in the code that my feature engineering was essentially a desperate fishing expedition where I tried a whole lot of stuff. ¬†I didn’t even end up using everything that I had engineered through my R code, but as my final model included 35 variables, I wasn’t suffering any severe lack! If you download the KDD Cup 2015 data and are having a look around, feel free to let me know if I’ve missed any important variables!

H2O, Model Tuning, and Training of The Final Model

This is the part where I managed to train hundreds of models! I don’t think this would have been feasible just using plain R on my computer alone (I have 8GB of RAM and an 8 core AMD processor). For these tasks I turned to H2O. For those who don’t know, H2O is a java based analytical interface for cloud computing that is frankly very easy and beneficial to set up when all you have at your disposal is one computer. I say beneficial for one reason: my computer chokes when trying to train ensemble models on even moderate sized data sets. Through H2O, I’m able to get it done without watching the RAM meter on my system monitor shoot all the way up to full capacity!! What you’ll notice in my R code is that R is able to interface with H2O in such a way that once I passed the dataframe with the training data to H2O, it was H2O that handled the modeling from there, and sends info back to R when available or requested (e.g. while you’re training a model, it gives you a cute text-based progress bar automatically!). More on this soon.

Before I show some results, I want to talk about my model tuning algorithm. Let’s look at the relevant code, then I’ll break it down verbally.

ntree = seq(100,500,100)
balance_class = c(TRUE,FALSE)
learn_rate = seq(.05,.4,.05)

parameters = list(ntree = c(), balance_class = c(), learn_rate = c(), r2 = c(), min.r2 = c(), max.r2 = c(), acc = c(), min.acc = c(), max.acc = c(), AUC = c(), min.AUC = c(), max.AUC = c())
n = 1

mooc.hex = as.h2o(localH2O, mooc[,c("enrollment_id","dropped_out_factor",x.names)])
for (trees in ntree) {
  for (c in balance_class) {
    for (rate in learn_rate) {
      r2.temp = c(NA,NA,NA)
      acc.temp = c(NA,NA,NA)
      auc.temp = c(NA,NA,NA)
      for (i in 1:3) {
        
        mooc.hex.split = h2o.splitFrame(mooc.hex, ratios=.8)   
        train.gbm = h2o.gbm(x = x.names, y = "dropped_out_factor",  training_frame = mooc.hex.split[[1]],
                            validation_frame = mooc.hex.split[[2]], ntrees = trees, balance_classes = c, learn_rate = rate)
        r2.temp[i] = train.gbm@model$validation_metrics@metrics$r2
        acc.temp[i] = train.gbm@model$validation_metrics@metrics$max_criteria_and_metric_scores[4,3]
        auc.temp[i] = train.gbm@model$validation_metrics@metrics$AUC
      }
    parameters$ntree[n] = trees
    parameters$balance_class[n] = c
    parameters$learn_rate[n] = rate
    parameters$r2[n] = mean(r2.temp)
    parameters$min.r2[n] = min(r2.temp)
    parameters$max.r2[n] = max(r2.temp)
    parameters$acc[n] = mean(acc.temp)
    parameters$min.acc[n] = min(acc.temp)
    parameters$max.acc[n] = max(acc.temp)
    parameters$AUC[n] = mean(auc.temp)
    parameters$min.AUC[n] = min(auc.temp)
    parameters$max.AUC[n] = max(auc.temp)
    n = n+1
    }
  }
}


parameters.df = data.frame(parameters)
parameters.df[which.max(parameters.df$AUC),]

The model that I decided to use is my usual favourite, gradient boosting machines (h2o.gbm is the function you use to train a gbm model through H2O). As such, the 3 hyperparameters which I chose to vary and evaluate in the model tuning process were number of trees, whether or not to balance the outcome classes through over/undersampling, and the learning rate. As you can see above, I wanted to try out numerous values for each hyperparameter, making 5 values for number of trees, 2 values for balance classes, and 8 values for learning rate, totalling 80 possible combinations of all 3 hyperparameter values together. Furthermore, I wanted to try out each combination of hyperparemeter values on 3 random samples of the training data. So, 3 samples of each one of 80 combinations is equal to 240 models trained and validated with the aim of selecting the one with the best area under the curve (AUC). As you can see, each time I trained a model, I saved and summarised the validation stats in a growing list which I ultimately converted to a data.frame and called called parameters.df

The best hyperparameters, according to these validation stats which I collected, are:

– ntree = 500
– balance_class = FALSE
– learn_rate = .05

You can see a very nice summary of how validation set performance changed depending on the values of all of these parameters in the image below (the FALSE and TRUE over the two facets refer to the balance_class values.

AUC by Tuning Parameters

Have a look at my validation data model summary output from the H2O package below:

H2OBinomialMetrics: gbm
** Reported on validation data. **

MSE:  0.06046745
R^2:  0.102748
LogLoss:  0.2263847
AUC:  0.7542866
Gini:  0.5085732

Confusion Matrix for F1-optimal threshold:
            dropped out stayed    Error         Rate
dropped out       21051   1306 0.058416  =1306/22357
stayed             1176    576 0.671233   =1176/1752
Totals            22227   1882 0.102949  =2482/24109

Maximum Metrics:
                      metric threshold    value        idx
1                     max f1  0.170555 0.317006 198.000000
2                     max f2  0.079938 0.399238 282.000000
3               max f0point5  0.302693 0.343008 134.000000
4               max accuracy  0.612984 0.929321  48.000000
5              max precision  0.982246 1.000000   0.000000
6           max absolute_MCC  0.170555 0.261609 198.000000
7 max min_per_class_accuracy  0.061056 0.683410 308.000000

The first statistic that my eyes were drawn to when I saw this output was the R^2 statistic. It looks quite low and I’m not even sure why. That being said, status in the KDD Cup 2015 competition is measured in AUC, and here you can see that it is .75 on my validation data. Next, have a look at the confusion matrix. You can see in the Error column that the model did quite well predicting who would drop out (naturally, in my opinion), but did not do so well figuring out who would stay. The overall error rate on the validation data is 10%, but I’m still not so happy about the high error rate as it pertains to those who stayed in the MOOC.

So this was all well and good (and was what got me my highest score yet according to the KDD Cup leaderboard) but what if I could get better performance with fewer variables? I took a look at my variable importances and decided to see what would happen if I eliminate the variables with the lowest importance scores one by one until I reach the variable with the 16th lowest importance score. Here’s the code I used:

varimps = data.frame(h2o.varimp(train.gbm))
variable.set = list(nvars = c(), AUC = c(), min.AUC = c(), max.AUC = c())

mooc.hex = as.h2o(localH2O, mooc[,c("enrollment_id","dropped_out_factor",x.names)])
n = 1
for (i in seq(35,20)) {
  auc.temp = c(NA,NA,NA)
  x.names.new = setdiff(x.names, varimps$variable[i:dim(varimps)[1]])
  for (j in 1:3) {
        mooc.hex.split = h2o.splitFrame(mooc.hex, ratios=.8)  
        train.gbm.smaller = h2o.gbm(x = x.names.new, y = "dropped_out_factor",  training_frame = mooc.hex.split[[1]],
                            validation_frame = mooc.hex.split[[2]], ntrees = 500, balance_classes = FALSE, learn_rate = .05)
        auc.temp[j] = train.gbm.smaller@model$validation_metrics@metrics$AUC
        }
    variable.set$AUC[n] = mean(auc.temp)
    variable.set$min.AUC[n] = min(auc.temp)
    variable.set$max.AUC[n] = max(auc.temp)
    variable.set$nvars[n] = i-1
    n = n + 1
}

variable.set.df = data.frame(variable.set)

You can see that it’s a similar algorithm as what I used to do the model tuning. I moved up the variable importance list from the bottom, one variable at a time, and progressively eliminated more variables. I trained 3 models for each new number of variables, each on a random sample of the data, and averaged the AUCs from those models (totalling 48 models). See the following graph for the result:

AUC by num vars

As you can see, even though the variables I eliminated were of the lowest importance, they were still contributing something positive to the model. This goes to show how well GBM performs with variables that could be noisy.

Now let’s look at the more important variables according to H2O:

                           variable relative_importance scaled_importance   percentage
1                 num_logged_events        48481.160156      1.000000e+00 5.552562e-01
2     DAYS_problem_total_etime_unix        11651.416992      2.403288e-01 1.334440e-01
3                      days.in.mooc         6495.756348      1.339852e-01 7.439610e-02
4      DAYS_access_total_etime_unix         3499.054443      7.217349e-02 4.007478e-02
5                         avg_month         3019.399414      6.227985e-02 3.458127e-02
6                           avg_day         1862.299316      3.841285e-02 2.132897e-02
7                    Pct_sequential         1441.578247      2.973481e-02 1.651044e-02
8    DAYS_navigate_total_etime_unix          969.427734      1.999597e-02 1.110289e-02
9                       num_courses          906.499451      1.869797e-02 1.038217e-02
10                      Pct_problem          858.774353      1.771357e-02 9.835569e-03
11                     num_students          615.350403      1.269257e-02 7.047627e-03

Firstly, we see that the number of logged events was the most important variable for predicting drop-out. I guess the more active they are, the less likely they are to drop out. Let’s see a graph:

MOOC dropout by num logged events

Although a little bit messy because I did not bin the num_logged_events variable, we see that this is exactly the case that those students who were more active online were less likely to drop out.

Next, we see a few variables regarding the days spent doing something. They seem to follow similar patterns, so the image I’ll show you below involves the days.in.mooc variable. This is simply how many days passed from the logging of the first event to the last.

MOOC dropouts by days in mooc

Here we see a very steady decrease in probability of dropping out where those who spent very little time from their first to their last interaction with the MOOC are the most likely to drop out, whereas those who spend more time with it are obviously less likely.

Next, let’s look at the avg_month and avg_day variables. These were calculated by taking the average timestamp of all events for each person enrolled in each course and then extracting the month and then the day from that timestamp. Essentially, when, on average, did they tend to do that course.

MOOC dropout by avg month and day

Interestingly, most months seem to exhibit a downward pattern, whereby if the person tended to have their interactions with the MOOC near the end of the month, then they were less likely to drop out, but if they had their interactions near the beginning they were more likely to drop out. This applied to February, May, June, November, and December. The reverse seems to be true for July and maybe October. January maybe applies to the second list.

The last two plots I’ll show you relate to num_courses and num_students, in other words, how many courses each student is taking and how many students are in each course.

MOOC dropouts by # courses per student

MOOC dropout by course popularity

The interesting result here is that it’s only those students who were super committed (taking more than 20 courses in the period captured by the data) who appeared significantly less likely to drop out than those who were taking fewer courses.

Finally, you can see that as the number of students enrolled in a course went up, the overall drop-out rate decreased. Popular courses retain students!

Conclusion

This was fun! I was amazed by how obsessed I became on account of this competition. I’m disappointed that I couldn’t think of something to bridge the 2.6% gap between me and first place, but the point of this was to practice, to learn something new, and have fun. I hope you enjoyed it too!

save.ffdf and load.ffdf: Save and load your big data – quickly and neatly!

I’m very indebted to the ff and ffbase packages in R. ¬†Without them, I probably would have to use some less savoury stats program for my bigger data analysis projects that I do at work.

Since I started using ff and ffbase, I have resorted to saving and loading my ff dataframes using ffsave and ffload. ¬†The syntax isn’t so bad, but the resulting process it puts your computer through to save and load your ff dataframe is a bit cumbersome. ¬†It takes a while to save and load, and ffsave creates (by default) a bunch of randomly named ff files in a temporary directory.

For that reason, I was happy to come across a link to a pdf presentation (sorry, I’ve lost it now) summarizing some cool features of ffbase. ¬†I learned that instead of using ffsave and ffload, you can use save.ffdf and load.ffdf, which have very simple syntax:

save.ffdf(ffdfname, dir=”/PATH/TO/STORE/FF/FILES”)

Use that, and it creates a directory wherein it stores ff files that bear the same names as your column names from your ff dataframe!  It also stores an .RData and .Rprofile file as well.  Then there is:

load.ffdf(dir=”/PATH/TO/STORE/FF/FILES”)

As simple as that, you load your files, and you’re done! ¬†I think what I like about these functions is that they allow you to easily choose where the ff files are stored, removing the worry about important files being in your temporary directory.

Store your big data!!

Know Your Dataset: Specifying colClasses to load up an ffdf

When I finally figured out how to successfully use the ff package to load data into R, I was apparently working with relatively pain free data to load up through read.csv.ffdf (see my previous post). ¬†Just this past Sunday, I naively followed my own post to load a completely new dataset (over 400,000 rows and about 180 columns) for analysis. ¬†Unfortunately for me, the data file was a bit messier, and so read.csv.ffdf wasn’t able to finalize the column classes by itself. ¬†It would chug along until certain columns in my dataset, which it at first took to be one data type, proved to be a different data type, and then it would give me an error message basically telling me it didn’t want to adapt to the changing assumptions of which data type each column represented.

So, I set out to learn how I could use the colClasses argument in the read.csv.ffdf command to manually set the data types for each column.  I adapted the following solution from a stackoverflow thread about specifying colClasses in the regular read.csv function.

First, load up a sample of the big dataset using the read.csv command (The following is obviously non-random. If you can figure out how to read the sample in randomly, I think it would work much better):

headset = read.csv(fname, header = TRUE, nrows = 5000)

The next command generates a list of all the variable names in your dataset, and the classes R was able to derive based on the number of rows you imported:

headclasses = sapply(headset, class)

Now comes the fairly manual part. Look at the list of variables and classes (data types) that you generated, and look for obvious mismatches. Examples could be a numeric variable that got coded as a factor or logical, or a factor that got coded as a numeric. When you find such a mismatch, the following syntax suffices for changing a class one at a time:

headclasses["variable.name"] = "numeric"

Obviously, the “variable.name” should be replaced by the actual variable name you’re reclassifying, and the “numeric” string can also be “factor”, “ordered”, “Date”, “POSIXct” (the last two being date/time data types). Finally, let’s say you want to change every variable that got coded as “logical” into “numeric”. Here’s some syntax you can use:

headclasses[grep("logical", headclasses)] = "numeric"

Once you are certain that all the classes represented in the list you just generated and modified are accurate to the dataset, you can load up the data with confidence, using the headclasses list:

bigdataset = read.csv.ffdf(file="C:/big/data/loc.csv", first.rows=5000, colClasses=headclasses)

This was certainly not easy, but I must say that I seem to be willing to jump through many hoops for R!!

Big data analysis, for free, in R (or “How I learned to load, manipulate, and save data using the ff package”)

Before choosing to support the purchase of Statistica at my workplace, I came across the ff package as an option for working with really big datasets (with special attention paid to ff dataframes, or ffdf). It looked like a good option to use, allowing dataframes with multiple data types and way more rows than if I were loading such a dataset into RAM as is normal with R. The one big problem I had is that every time I tried to use the ffsave function to save my work from one R session to the next, it told me that it could not find an external zip utility on my Windows machine. I guess because I just had so much else going on, I didn’t have the patience to do the research to find a solution to this problem.

This weekend I finally found some time to revisit this problem, and managed to find a solution! From what I can tell, R appears to expect, in cases like the ffsave function, that you have command-line utilities like a zip utility at the ready and recognizable by R. Although I haven’t tested the ff package on either of my linux laptops at home, I suspect that R recognizes the utilities that come pre-installed on them. However, in the windows case, the solution seems to be to install a supplementary group of¬†command-line¬†programs called Rtools. ¬†When you visit the page, be sure to download the version of Rtools that corresponds with your R version.

When you go through the installation process, you will see a screen like below. Be sure that you check the same boxes as in the screenshot below so that R knows where the zip utility lives.

Once you have it installed, that’s when the fun finally begins. Like in the smaller data case, I like reading in CSV files. So, ff provides read.csv.ffdf for importing external data into R. Let’s say that you have a data file named bigdata.csv, here would be a command for loading it up:

bigdata = read.csv.ffdf(file=”c:/fileloc/bigdata.csv”, first.rows=5000, colClasses=NA)

The first part of the command, directing R to your file, should look straightforward. The first.rows argument tells it how big you want the first chunk of data it reads in should be (ff reads parts of your data at a time to save RAM. ¬†Correct me if I’m wrong). ¬†Finally, and importantly, the colClasses=NA argument tells R not to assume the data types of each of your columns from the first chunk alone.

Now that you’ve loaded your big dataset, you can manipulate it at will. ¬†If you look at the ff and ffbase¬†documentation, a lot of the standard R functions for working with and summarizing data have been optimized for use with ff dataframes and vectors. ¬†The upshot of this is that working with data stored in ffdf format seems to be a pretty similar experience compared to working with normal data frames. ¬†Importantly, when you want to subset your data frame to create a test sample, the ffbase package replaces the subset command so that the resultant subset is also an ffdf, and doesn’t take up more of your RAM.

I noticed that you can use the glm() and lm() functions on an ffdf, but I think you have to be careful because they are not optimized for use with ffdfs and therefore will take up the usual amount of memory if you save them to your workspace. ¬†So if you build models using these functions, be sure to select a sample from your ffdf that isn’t overly big!

Next, comes the step of saving your work.  The syntax is simple enough:

ffsave(bigdata, file=”C:/fileloc/Rwork/bigdata”)

This saves a .ffData file and a .RData file to the directory of your choice with “bigdata” as the filenames.

Then, when you want to load up your data in a new R session during some later time, you use the simple ffload command:

ffload(file=”C:/fileloc/Rwork/bigdata”)

It gives you some warning messages, but as far as I can tell they do not get in the way of accessing your data. ¬†That covers the basics of working with big data using the ff package. ¬† ¬†Have fun analyzing your data using less RAM! ūüôā

Memory Management in R, and SOAR

The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer. ¬†Keep in mind I’ve got a quad core with 8 gigs of RAM. ¬†With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R.

The best/easiest solution I’ve found so far is in a package called SOAR. ¬†To put it simply, it allows you to store specific objects in R (data frames being the most important, for me) as RData files on your hard drive, and gives you the ability to analyze them in R without having them loaded into your RAM. ¬†I emphasized the term analyze because every time I try to add variables to the data frames that I store, the data frame comes back into RAM and once again slows me down.

An example might suffice:

> r = data.frame(a=rnorm(10,2,.5),b=rnorm(10,3,.5))
> r

  a       b

1 1.914092 3.074571
2 2.694049 3.479486
3 1.684653 3.491395
4 1.318480 3.816738
5 2.025016 3.107468
6 1.851811 3.708318
7 2.767788 2.636712
8 1.952930 3.164896
9 2.658366 3.973425
10 1.809752 2.599830
> library(SOAR)
> Sys.setenv(R_LOCAL_CACHE=”testsession”)
> ls()
[1] “r”
> Store(r)
> ls()
character(0)
> mean(r[,1])
[1] 2.067694
> r$c = rnorm(10,4,.5)
> ls()
[1] “r”

So, the first thing I did was to make a data frame with some columns, which got stored in my workspace, and thus loaded into RAM. ¬†Then, I initialized the SOAR library, and set my local cache to “testsession”. ¬†The practical implication of that is that a directory gets created within the current directory that R is working out of (in my case, “/home/inkhorn/testsession”), and that any objects passed to the Store command get saved as RData files in that directory.

Sure enough, you see my workspace before and after I store the r object. ¬†Now you see the object, now you don’t! ¬†But then, as I show, even though the object is not in the workspace, you can still analyze it (in my case, calculate a mean from one of the columns). ¬†However, as soon as I try to make a new column in the data frame… voila … it’s back in my workspace, and thus RAM!

So, unless I’m missing something about how the package is used, it doesn’t function exactly as I would like, but it’s still an improvement. ¬†Every time I’m done making new columns in the data frame, I just have to pass the object to the Store command, and away to the hard disk it goes, and out of my RAM. ¬†It’s quite liberating not having a stupendously heavy workspace, as when I’m trying to leave or enter R, it takes forever to save/load the workspace. ¬†With the heavy stuff sitting on the hard disk, leaving and entering R go by a lot faster.

Another thing I noticed is that if I keep the GLMs that I’ve generated in my workspace, that seems to take up a lot of RAM as well and slow things down. ¬†So, with writing the main dataframe to disk, and keeping GLMs out of memory, R is flying again!

Guess who wins: apply() versus for loops in R

Yesterday I tried to do some data processing on my really big data set in MS Excel. Wow, did it not like handling all those data!! Every time I tried to click on a different ribbon, the screen didn’t even register that I had clicked on that ribbon. So, I took the hint, and decided to do my data processing in R.

One of the tasks that I needed to do was calculate a maximum value, in each row of the dataset, from multiple monetary values in 5 different fields. The first thing I noticed was that the regular max() function in R doesn’t quite like it when you try to calculate a maximum from a series of NA values (it returned an inf value for some reason…). So, I decided to create a “safe” max function:

Finding that it was working, I then constructed a simple for loop to iterate through my ~395,000 rows. As you could imagine, this was taking forever! After much looking around, I realized that the best solution was actually a base function, apply()!!

I constructed my “max” variable with one simple line of code: big.dataset$max_money = apply(as.matrix(big.dataset[,214:218]), 1, function (x) safe.max(x))

Compared to the for loop, which was taking forever, this method was a breeze! It took less than a minute to get through the whole data set. Moral of the story? When you’re dealing with lots of data, code as efficiently as possible!