Predictive modelling fun with the caret package

I’m back!  6 months after my second child was born, I’ve finally made it back to my blog with something fun to write about.  I recently read through the excellent Machine Learning with R ebook and was impressed by the caret package and how easy it made it seem to do predictive modelling that was a little more than just the basics.

With that in mind, I went searching through the UCI machine learning repository and found a dataset about leaves that looked promising for a classification problem.  The dataset comprises of leaves from almost 40 different plant species, and has 14 numerical attributes describing each leaf.  It comes with a pdf file that shows pretty pictures of each leaf for the botanists out there, and some very mathematics heavy descriptions of each of the attributes which I couldn’t even hope to understand with my lack of education on the matter!

Seeing that it didn’t look overly complex to process, I decided to load it in and set up the overall training parameters:

leaf = read.csv("leaf.csv", colClasses = c(Class = "factor"))
ctrl = trainControl(method="repeatedcv", number=10, repeats=5, selectionFunction = "oneSE")
in_train = createDataPartition(leaf$Class, p=.75, list=FALSE)

First, I made sure that the Class variable remained a factor, even though it’s coded with integers in the incoming data.  This way once I split the data into a test set, I won’t get any complaints about missing outcome values if the sampling doesn’t pick up one of those values!

You’ll notice I’ve tried repeated cross validation here, with 5 repeats, and have used the ‘oneSE’ selection function.  This ensures that for whichever model I choose, the model gets tested on 10 different parts of my data, repeated 5 times over, and then I’ve chosen the ‘oneSE’ function to hopefully select a model that is not the most complex.  Finally, I use createDataPartition to create a a training sample of 75% of the data.

trf = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="rf", metric="Kappa",
            trControl=ctrl, subset = in_train)

tgbm = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="gbm", metric="Kappa",
            trControl=ctrl, subset = in_train, verbose=FALSE)

I’ve chosen to use a random forest and a generalized boosted model to try to model leaf class.  Notice how I’ve referred to the training parameters in the trControl argument, and have selected the training subset by referring to in_train.  Also, the ‘verbose=FALSE’ argument in the gbm model is important!!  Let’s look at results:

For the trf model:

Random Forest
340 samples
15 predictors
30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 228, 231, 233, 233, 232, 229, ...

Resampling results across tuning parameters:

mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.7341953 0.7230754 0.07930583 0.08252806
8 0.7513803 0.7409347 0.08873493 0.09237854
14 0.7481404 0.7375215 0.08438226 0.08786254

Kappa was used to select the optimal model using the one
SE rule.
The final value used for the model was mtry = 8.

So as you can see it’s selected a random forest model that tries 8 random predictors at each split, and it seems to be doing pretty well with a Kappa of .74. Now let’s move on to the next results:

For the tgbm model:

Stochastic Gradient Boosting 

340 samples
 15 predictors
 30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 226, 231, 229, 231, 228, 231, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD  
  1                   50      0.6550713  0.6406862  0.07735511   0.08017461
  1                  100      0.6779153  0.6646128  0.07461615   0.07739666
  1                  150      0.6799633  0.6667613  0.08291638   0.08592416
  2                   50      0.7000791  0.6876577  0.08467911   0.08771728
  2                  100      0.6984858  0.6860858  0.08711523   0.09041647
  2                  150      0.6886874  0.6759011  0.09157694   0.09494201
  3                   50      0.6838721  0.6708396  0.08850382   0.09166051
  3                  100      0.6992044  0.6868055  0.08423577   0.08714577
  3                  150      0.6976292  0.6851841  0.08414035   0.08693979

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Kappa was used to select the optimal model using  the one SE rule.
The final values used for the model were n.trees = 50, interaction.depth = 2 and shrinkage = 0.1.

Here we see that it has chosen a gbm model with an interaction depth of 2 and 50 trees. This has a kappa of .69, which appears somewhat worse than the random forest model. Let’s do a direct comparison:

resampls = resamples(list(RF = trf,
                          GBM = tgbm))

difValues = diff(resampls)

summary.diff.resamples(object = difValues)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

    RF        GBM    
RF            0.05989
GBM 0.0003241        

    RF        GBM    
RF            0.06229
GBM 0.0003208  

Sure enough, the difference is statistically significant. The GBM value ends up being less accurate than the random forest model. Now let’s go to the testing stage! You’ll notice I’ve now stuck with the random forest model.

test = leaf[-in_train,]
test$pred.leaf.rf = predict(trf, test, "raw")
confusionMatrix(test$pred.leaf.rf, test$Class)

Overall Statistics
               Accuracy : 0.7381         
                 95% CI : (0.6307, 0.828)
    No Information Rate : 0.0833         
    P-Value [Acc > NIR] : < 2.2e-16      
                  Kappa : 0.7277         
 Mcnemar's Test P-Value : NA      

Please excuse the ellipses above as the confusionMatrix command generates voluminous output! Anyway, sure enough the Kappa statistic was not that far off in the test sample as it was from the training sample (recall it was .74). Also of interest to me (perhaps it’s boring to you!) is the No Information Rate. Allow me to explain: If I take all of the known classes in the testing sample, and just randomly guess which records to which they belong, I will probably get some right. And this is exactly what the No Information Rate is; the proportion of classes that you would guess right if you randomly allocated them. Obviously an accuracy of .74 and a Kappa of .73 are way higher than the No Information Rate, and so I’m happy that the model is doing more than just making lucky guesses!

Finally, caret has a function to calculate variable importance so that you can see which variables were the most informative in making distinctions between classes.  The results for the random forest model follow:

varImp(trf, scale=FALSE)
rf variable importance

Solidity         31.818
Aspect_Ratio     26.497
Eccentricity     23.300
Elongation       23.231
Isoperimetric    20.001
Entropy          18.064
Lobedness        15.608
Max_Ind_Depth    14.828
Uniformity       14.092
Third_Moment     13.148
Stoch_Convexity  12.810
Avg_Intensity    12.438
Smoothness       10.576
Avg_Contrast      9.481

As I have very little clue what these variables mean from their descriptions, someone much wiser than me in all things botanical would have to chime in and educate me.

Well, that was good fun! If you have any ideas to keep the good times rolling and get even better results, please chime in by commenting 🙂


11 thoughts on “Predictive modelling fun with the caret package

  1. There’s a bunch of cool plots you can make too! Try: `xyplot(resampls); dotplot(resampls); densityplot(resampls); bwplot(resampls); splom(resampls); parallelplot(resampls);`

    Read more at `?xyplot.resamples`

  2. Hi there, thanks for sharing this with us! I am trying to play with the code.

    There’s a typo here and I am wondering what’s actual code?
    leaf = read.csv(“leaf.csvlClasses = c(Class = “factor”))

      • my browser had the same problem. When the page first loaded, the line said
        leaf = read.csv(“leaf.csv”, colClasses = c(Class = “factor”))
        Then within a few seconds, that line collapsed to:
        leaf = read.csv(“leaf.csvlClasses = c(Class = “factor”))
        Strange stuff.

      • I get error in
        leaf = read.csv(“leaf.csv”,colClasses = c(Class = “factor”))

        Warning message:
        In read.table(file = file, header = header, sep = sep, quote = quote, :
        not all columns named in ‘colClasses’ exist

      • My apologies! I forgot that the csv file didn’t have a header row!! I named each of the columns using the description on the UCI website, but they are also in the PDF file provided.

  3. Pingback: News Fusion | Data Analytics & R

  4. The varImp provides the importance by the variable from 0-100. It does so by randomly swapping data in between records in the training set. The more errors this causes, the more important the variable is. If all tests introduce incorrect predictions, then you reach 100. If no test introduce an error in the predicted value, then you get 0. It is a kind of principal component analysis based on random selection.

    • Hi Jonas,

      Thanks for the glimps of varImp. But what does the varImp function in the caret package actually compute for a glm object

      Some more explanation would be appreciated.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s