I’m back! 6 months after my second child was born, I’ve finally made it back to my blog with something fun to write about. I recently read through the excellent Machine Learning with R ebook and was impressed by the caret package and how easy it made it seem to do predictive modelling that was a little more than just the basics.
With that in mind, I went searching through the UCI machine learning repository and found a dataset about leaves that looked promising for a classification problem. The dataset comprises of leaves from almost 40 different plant species, and has 14 numerical attributes describing each leaf. It comes with a pdf file that shows pretty pictures of each leaf for the botanists out there, and some very mathematics heavy descriptions of each of the attributes which I couldn’t even hope to understand with my lack of education on the matter!
Seeing that it didn’t look overly complex to process, I decided to load it in and set up the overall training parameters:
library(caret) leaf = read.csv("leaf.csv", colClasses = c(Class = "factor")) ctrl = trainControl(method="repeatedcv", number=10, repeats=5, selectionFunction = "oneSE") in_train = createDataPartition(leaf$Class, p=.75, list=FALSE)
First, I made sure that the Class variable remained a factor, even though it’s coded with integers in the incoming data. This way once I split the data into a test set, I won’t get any complaints about missing outcome values if the sampling doesn’t pick up one of those values!
You’ll notice I’ve tried repeated cross validation here, with 5 repeats, and have used the ‘oneSE’ selection function. This ensures that for whichever model I choose, the model gets tested on 10 different parts of my data, repeated 5 times over, and then I’ve chosen the ‘oneSE’ function to hopefully select a model that is not the most complex. Finally, I use createDataPartition to create a a training sample of 75% of the data.
trf = train(Class ~ Eccentricity + Aspect_Ratio + Elongation + Solidity + Stoch_Convexity + Isoperimetric + Max_Ind_Depth + Lobedness + Avg_Intensity + Avg_Contrast + Smoothness + Third_Moment + Uniformity + Entropy, data=leaf, method="rf", metric="Kappa", trControl=ctrl, subset = in_train) tgbm = train(Class ~ Eccentricity + Aspect_Ratio + Elongation + Solidity + Stoch_Convexity + Isoperimetric + Max_Ind_Depth + Lobedness + Avg_Intensity + Avg_Contrast + Smoothness + Third_Moment + Uniformity + Entropy, data=leaf, method="gbm", metric="Kappa", trControl=ctrl, subset = in_train, verbose=FALSE)
I’ve chosen to use a random forest and a generalized boosted model to try to model leaf class. Notice how I’ve referred to the training parameters in the trControl argument, and have selected the training subset by referring to in_train. Also, the ‘verbose=FALSE’ argument in the gbm model is important!! Let’s look at results:
For the trf model:
Random Forest 340 samples 15 predictors 30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9' No pre-processing Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 228, 231, 233, 233, 232, 229, ... Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.7341953 0.7230754 0.07930583 0.08252806 8 0.7513803 0.7409347 0.08873493 0.09237854 14 0.7481404 0.7375215 0.08438226 0.08786254 Kappa was used to select the optimal model using the one SE rule. The final value used for the model was mtry = 8.
So as you can see it’s selected a random forest model that tries 8 random predictors at each split, and it seems to be doing pretty well with a Kappa of .74. Now let’s move on to the next results:
For the tgbm model:
Stochastic Gradient Boosting 340 samples 15 predictors 30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9' No pre-processing Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 226, 231, 229, 231, 228, 231, ... Resampling results across tuning parameters: interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD 1 50 0.6550713 0.6406862 0.07735511 0.08017461 1 100 0.6779153 0.6646128 0.07461615 0.07739666 1 150 0.6799633 0.6667613 0.08291638 0.08592416 2 50 0.7000791 0.6876577 0.08467911 0.08771728 2 100 0.6984858 0.6860858 0.08711523 0.09041647 2 150 0.6886874 0.6759011 0.09157694 0.09494201 3 50 0.6838721 0.6708396 0.08850382 0.09166051 3 100 0.6992044 0.6868055 0.08423577 0.08714577 3 150 0.6976292 0.6851841 0.08414035 0.08693979 Tuning parameter 'shrinkage' was held constant at a value of 0.1 Kappa was used to select the optimal model using the one SE rule. The final values used for the model were n.trees = 50, interaction.depth = 2 and shrinkage = 0.1.
Here we see that it has chosen a gbm model with an interaction depth of 2 and 50 trees. This has a kappa of .69, which appears somewhat worse than the random forest model. Let’s do a direct comparison:
resampls = resamples(list(RF = trf, GBM = tgbm)) difValues = diff(resampls) summary(difValues) Call: summary.diff.resamples(object = difValues) p-value adjustment: bonferroni Upper diagonal: estimates of the difference Lower diagonal: p-value for H0: difference = 0 Accuracy RF GBM RF 0.05989 GBM 0.0003241 Kappa RF GBM RF 0.06229 GBM 0.0003208
Sure enough, the difference is statistically significant. The GBM value ends up being less accurate than the random forest model. Now let’s go to the testing stage! You’ll notice I’ve now stuck with the random forest model.
test = leaf[-in_train,] test$pred.leaf.rf = predict(trf, test, "raw") confusionMatrix(test$pred.leaf.rf, test$Class) ... Overall Statistics Accuracy : 0.7381 95% CI : (0.6307, 0.828) No Information Rate : 0.0833 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7277 Mcnemar's Test P-Value : NA ...
Please excuse the ellipses above as the confusionMatrix command generates voluminous output! Anyway, sure enough the Kappa statistic was not that far off in the test sample as it was from the training sample (recall it was .74). Also of interest to me (perhaps it’s boring to you!) is the No Information Rate. Allow me to explain: If I take all of the known classes in the testing sample, and just randomly guess which records to which they belong, I will probably get some right. And this is exactly what the No Information Rate is; the proportion of classes that you would guess right if you randomly allocated them. Obviously an accuracy of .74 and a Kappa of .73 are way higher than the No Information Rate, and so I’m happy that the model is doing more than just making lucky guesses!
Finally, caret has a function to calculate variable importance so that you can see which variables were the most informative in making distinctions between classes. The results for the random forest model follow:
varImp(trf, scale=FALSE) rf variable importance Overall Solidity 31.818 Aspect_Ratio 26.497 Eccentricity 23.300 Elongation 23.231 Isoperimetric 20.001 Entropy 18.064 Lobedness 15.608 Max_Ind_Depth 14.828 Uniformity 14.092 Third_Moment 13.148 Stoch_Convexity 12.810 Avg_Intensity 12.438 Smoothness 10.576 Avg_Contrast 9.481
As I have very little clue what these variables mean from their descriptions, someone much wiser than me in all things botanical would have to chime in and educate me.
Well, that was good fun! If you have any ideas to keep the good times rolling and get even better results, please chime in by commenting :)