Predicting Mobile Phone Prices

Recently a colleague of mine showed me a nauseating interactive scatterplot that plots mobile phones according to two dimensions of the user’s choice from a list of possible dimensions.  Although the interactive visualization was offensive to my tastes, the JSON data behind the visualization was intriguing.  It was easy enough to get the data behind it (see this link if you want an up to date copy and be sure to take out the “data=” from the start of the file! I pulled this data around noon on March 23rd.) so that I could start asking a simple question: Which of the available factors provided in the dataset were the most predictive of full mobile phone price?

I’ll present the graphs and then the predictive model first and then the code later on:

Price by OS and Brand:

Often when investigating a topic using data, we confirm things that we already knew to be true.  This is certainly the case here with price by OS and brand.  From the below boxplots we see that the bulk of iOS devices tend to be the most expensive, and that brand-wise Apple, Google, and Samsung seem to stick out.

Mobile Phone Price by Operating System

Mobile Phone Prices by Brand

Price by Storage Capacity, RAM, and SD Card Capacity:

Storage capacity is perhaps the least surprising to find as having such a sharply positive correlation with price. I think what is more surprising to me is that there aren’t more gradations of storage capacity in the higher range past 50 gigabytes.  I’m guessing this is because the bulk of these phones (bearing in mind roughly 90% of these phones are in fact smart phones) are catered towards lower income folks.  Can you guess which phones occupy the top-right-most position on the first graph?  If your answer involved the iPhone 6 then you’re right on two counts!

As you can see, the correlation between RAM and price is pretty linear (with phones costing $171.54 more for each additional gigabyte of RAM) and that between SD Card capacity and price is linear past the large group of phones with 0 SD Card capacity (with phones costing $3.64 more for each additional gigabyte of SD Card Capacity).

Price by Storage Capacity

Price by RAM

Price by SD Card

Price by Screen Size, Battery, and Weight:

The next factors that I think one would naturally think of when considering the price of a mobile phone are all related to how big the thing is. Smart phones these days have a lot of physical presence just by dint of their screen size alone. Add to the large screen size the batteries that are used to support such generous displays and you also get an impressive variety of weights to these phones.

In fact, for every additional inch of screen size to these phones, you can expect an additional .81504 ounces and 565.11 mAh of battery capacity. My own humble little smartphone (an HTC Desire 601) happens to be on the smaller and lighter side of the spectrum as far as screen size and weight goes (4.5 inches screen size, or 33rd percentile; 4.59 ounces or 26th percentile) but happens to have a pretty generous battery capacity all things considered (2100 mAh, or 56th percentile).

While positive correlations can be seen between Price and all these 3 factors, battery was the most correlated with Price, next to screen size and then weight.  There’s obviously a lot of variability in price when you look at the phones with the bigger screen sizes, as they probably tend to come packed with a variety of premium extra features that can be used to jack up the price.

Price by Screen Size

Price by Battery

Price by Weight

Putting it all together in a model:
Finally, let’s lump all of the factors provided in the data set into a model, and see how well it performs on a testing sample. I decided on an 80/20 training/testing split, and am of course using Max Kuhn’s fabulous caret package to do the dirty work. I ran a gbm model, shown below, and managed to get an R squared of 60.4% in the training sample, so not bad.

Stochastic Gradient Boosting 

257 samples
 23 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 173, 173, 171, 171, 172, 171, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE      Rsquared   RMSE SD   Rsquared SD
  1                   50      150.1219  0.5441107  45.36781  0.1546993  
  1                  100      147.5400  0.5676971  46.03555  0.1528225  
  1                  150      146.3710  0.5803005  45.00296  0.1575795  
  2                   50      144.0657  0.5927624  45.46212  0.1736994  
  2                  100      143.7181  0.6036983  44.80662  0.1787351  
  2                  150      143.4850  0.6041207  45.57357  0.1760428  
  3                   50      148.4914  0.5729182  45.27579  0.1903465  
  3                  100      148.5363  0.5735842  43.41793  0.1746064  
  3                  150      148.8497  0.5785677  43.39338  0.1781990  

Tuning parameter 'shrinkage' was held constant at a value of 0.1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 150, interaction.depth = 2 and shrinkage = 0.1.

Now let’s look at the terms that came out as the most significant in the chosen model.  Below we see some unsurprising findings! Storage, battery, weight, RAM, and whether or not the phone uses iOS as the top 5. I guess I’m surprised that screen size was not higher up in the priority list, but at least it got in 6th place!

gbm variable importance

  only 20 most important variables shown (out of 41)

att_storage      100.0000
att_battery_mah   59.7597
att_weight        46.5410
att_ram           27.5871
att_osiOS         26.9977
att_screen_size   21.1106
att_sd_card       20.1130
att_brandSamsung   9.1220

Finally, let’s look at how our model did in the testing sample. Below I’ve shown you a plot of actual versus predicted price values. The straight line is what we would expect to see if there were a perfect correlation between the two (obviously not!!) while the smoothed line is the trend that we actually do see in the scatter plot. Considering the high R squared in the testing sample of 57% (not too far off from the training sample) it’s of course a nice confirmation of the utility of this model to see the smooth line following that perfect prediction line, but I won’t call be calling up Rogers Wireless with the magical model just yet!

Price by Predicted Price

In fact, before I close off this post, it would be remiss of me not to investigate a couple of cases in this final graph that look like outliers. The one on the bottom right, and the one on the top left.

The one on the bottom right happens to be a Sony Xperia Z3v Black with 32GB of storage space. What I learned from checking into this is that since the pricing data on the source website is pulled from, sometimes instead of pulling the full regular price, it happens to pull the data on a day when a special sale or service agreement price is listed. When I pulled the data, the Xperia was listed at a price of $29.99. Today, on April 6th, the price that you would get if you looked it up through the source website is .99! Interestingly, my model had predicted a full price of $632.17, which was not very far off from the full price of $599.99 that you can see if you go on the listing on Not bad!

Now, how about the phone that cost so much but that the model said shouldn’t? This phone was none other than the Black LG 3960 Google Nexus 4 Unlocked GSM Phone with 16GB of Storage space. The price I pulled that day was a whopping $699.99 but the model only predicted a price of $241.86! Considering the specs on this phone, the only features that really seem to measure up are the storage (16GB is roughly in the 85th percentile for smart phones) and the RAM (2 GB is roughly in the 93rd percentile for smart phones). Overall though, the model can’t account for any other qualities that Google might have imbued into this phone that were not measured by the source website. Hence, this is a formidable model outlier!

If you take out the Sony Xperia that I mentioned first, the Adjusted R squared value goes up from 57% to 74%, and the Residual Standard Error decreases from $156 to $121. That’s a lot of influence for just one outlier that we found to be based on data quality alone. Wow!

Reflecting on this exercise, the one factor that I wished were collected is processor speed.  I’m curious how much that would factor into pricing decisions, but alas this information was unavailable.

Anyway, this was fun, and I hope not too boring for you, the readers. Thanks for reading!!

Contraceptive Choice in Indonesia

I wanted yet another opportunity to get to use the fabulous caret package, but also to finally give a try.  To scratch both itches, I dipped into the UCI machine learning library yet again and came up with a survey data set on the topic of contraceptive choice in Indonesia.  This was an interesting opportunity for me to learn about a far-off place while practicing some fun data skills.

According to recent estimates, Indonesia is home to some 250 million individuals and over the years, thanks to government intervention, has had its fertility rate slowed down from well over 5 births per woman, to a current value of under 2.4 births per woman.  Despite this slow down, Indonesia is not generating enough jobs to satisfy the population.  Almost a fifth of their youth labour force (aged 15-24) are unemployed (19.6%), a whole 6.3% more than a recent estimate of the youth unemployment rate in Canada (13.3%).  When you’re talking about a country with 250 million individuals (with approximately 43.4 million 15-24 year olds), that’s the difference between about 5.8 million unemployed and 8.5 million unemployed teenagers/young adults.  The very idea is frightening!  Hence the government’s focus (decades ago and now) on promoting contraceptive method use to its population.

That is the spirit behind the survey data we’ll look at today!  First, download the data from my account and load it into R.


cmc = read.csv("cmc.csv", colClasses = c(Wife.Edu = "factor", Husband.Edu = "factor", Wife.Religion = "factor", 
                                         Wife.Working = "factor", Husband.Occu = "factor", Std.of.Living = "factor",
                                         Media.Exposure = "factor", Contraceptive.Method.Used = "factor"))

levels(cmc$Contraceptive.Method.Used) = c("None","Short-Term","Long-Term")

      None Short-Term  Long-Term 
       629        333        511 


      None Short-Term  Long-Term 
 0.4270197  0.2260692  0.3469111 

Everything in this data set is stored as a number, so the first thing I do is to define as factors what the documentation suggests are factors.  Then, we see the numeric breakdown of how many women fell into each contraceptive method category.  It’s 1473 responses overall, and ‘None’ is the largest category, although it is not hugely different from the others (although a chi squared test does tell me that the proportions are significantly different from one another).

Next up, let’s set up the cross validation and training vs testing indices, then train a bunch of models, use a testing sample to compare performance, and then we’ll see some graphs

# It's training time!
control = trainControl(method = "cv")
in_train = createDataPartition(cmc$Contraceptive.Method.Used, p=.75, list=FALSE)

contraceptive.model = train(Contraceptive.Method.Used ~ ., data=cmc, method="rf", metric="Kappa",
                            trControl=control, subset=in_train)

contraceptive.model.gbm = train(Contraceptive.Method.Used ~ ., data=cmc, method="gbm", metric="Kappa",
                            trControl=control, subset=in_train, verbose=FALSE)

contraceptive.model.svm = train(Contraceptive.Method.Used ~ ., data=cmc, method="svmRadial", metric="Kappa",
                                preProc = c('center','scale'), trControl=control, subset=in_train)

contraceptive.model.c50 = train(Contraceptive.Method.Used ~ ., data=cmc, method="C5.0", metric="Kappa",
                                trControl=control, subset=in_train, verbose=FALSE)

dec = expand.grid(.decay=c(0,.0001,.01,.05,.10))
control.mreg = trainControl(method = "cv")
contraceptive.model.mreg = train(Contraceptive.Method.Used ~ ., data=cmc, method="multinom", metric="Kappa",
                                  trControl=control.mreg, tuneGrid = dec, subset=in_train, verbose=FALSE)

# And now it's testing time...
cmc.test = cmc[-in_train,]

cmc.test = cmc.test %>% mutate(
  rf.class = predict(contraceptive.model, cmc.test, type="raw"),
  gbm.class = predict(contraceptive.model.gbm, cmc.test, type="raw"),
  svm.class = predict(contraceptive.model.svm, cmc.test, type="raw"),
  mreg.class = predict(contraceptive.model.mreg, cmc.test, type="raw"),
  c50.class = predict(contraceptive.model.c50, cmc.test, type="raw"))

# Here I'm setting up a matrix to host some performance statistics from each of the models
cmatrix.metrics = data.frame(model = c("rf", "gbm", "svm", "mreg","c50"), kappa = rep(NA,5), sensitivity1 = rep(NA,5), sensitivity2 = rep(NA,5), sensitivity3 = rep(NA,5), specificity1 = rep(NA,5), specificity2 = rep(NA,5), specificity3 = rep(NA,5))

# For each of the models, I use confusionMatrix to give me the performance statistics that I want
for (i in 11:15) {
  cmatrix.metrics[i-10,"kappa"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$overall[2][[1]]
  cmatrix.metrics[i-10, "sensitivity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,1]
  cmatrix.metrics[i-10, "sensitivity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,1]
  cmatrix.metrics[i-10, "sensitivity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,1]
  cmatrix.metrics[i-10, "specificity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,2]
  cmatrix.metrics[i-10, "specificity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,2]
  cmatrix.metrics[i-10, "specificity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,2]

# Now I transform my cmatrix.metrics matrix into a long format suitable for ggplot, graph it, and then post it to

cmatrix.long = melt(cmatrix.metrics, id.vars=1, measure.vars=c(2:8))

ggplot(cmatrix.long, aes(x=model, y=value)) + geom_point(stat="identity", colour="blue", size=3) + facet_grid(~variable) + theme(axis.text.x=element_text(angle=90,vjust=0.5, colour="black",size=12), axis.text.y=element_text(colour="black", size=12), strip.text=element_text(size=12), axis.title.x=element_text(size=14,face="bold"), axis.title.y=element_text(size=14, face="bold")) + ggtitle("Performance Stats for Contraceptive Choice Models") + scale_x_discrete("Model") + scale_y_continuous("Value")

py = plotly()

And behold the beautiful graph (after some tinkering in the interface, of course)

performance_stats_for_contraceptive_choice_modelsPlease note, the 1/2/3 next to sensitivity and specificity refer to the contraceptive use classes, ‘None’, ‘Short-term’, ‘Long-term’.

Firstly, the kappa stats are telling us that after you factor into accuracy the results you would expect by chance, the models don’t add a humongous level of predictive power.  So, we won’t collect our nobel prize on this one!  The classes are evidently more difficult to positively identify than they are to negatively identify, as evidenced by the lower sensitivity scores than specificity scores.  Interestingly, class 1 (users of NO contraceptive methods) was the most positively identifiable.

Secondly, it looks like we have a list of the top 3 performers in terms of kappa, sensitivity and specificity (in no particular order because I they don’t appear that different):

  • gbm (gradient boosting machine)
  • svm (support vector machine)
  • c50

Now that we have a subset of models, let’s see what they have to say about variable importance.  Once we see what they have to say, we’ll do some more fun graphing to see the variables in action.

gbm.imp = varImp(contraceptive.model.gbm)
svm.imp = varImp(contraceptive.model.svm)
c50.imp = varImp(contraceptive.model.c50)

I haven’t used ggplot here because I decided to try copy pasting the tables directly into for the heck of it. Let’s have a look at variable importance according to the three models now:



contraceptive_method_model_-_svm_variable_importanceOne commonality that you see throughout all of these graphs is that the wife’s age, her education, and the number of children born to her are among the most predictive in terms of classifying where she will fall on the contraceptive choice spectrum provided in the survey.  Given that result, let’s see some graphs that plot contraceptive choice against the aforementioned 3 predictive variables:

plot(Contraceptive.Method.Used ~ Wife.Age, data=cmc, main="Contraceptive Method Used According to Age of Wife") = melt(dcast(cmc, Num.Kids ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4),"Contraceptive.Method.Used","Num.Records")

ggplot(, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Number of Kids")
ggplot(, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Number of Kids") = melt(dcast(cmc, Wife.Edu ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4),"Contraceptive.Method.Used","Num.Records")

ggplot(, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Education Level of Wife")
ggplot(, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Education Level of Wife")

And now the graphs:

Contraceptive Choice by Age of Wife

Okay so this isn’t a plotly graph, but I do rather like the way this mosaic plot conveys the information.  The insight here is after a certain age (around age 36) the women seem to be less inclined to report the use of long-term contraceptive methods, and more inclined to report no contraceptive use at all.  Could it be that the older women feel that they are done with child bearing, and so do not feel the need for contraception anymore?  Perhaps someone more knowledgeable on the matter could enlighten me!


contraceptive_choice_by_education_level_of_wife prop

Here’s a neat one, showing us the perhaps not-so-revelatory result that as the education level of the wife increases, the likelihood that she will report no contraception diminishes.  Here it is in fact the short term contraceptive methods where we see the biggest increase in likelihood as the education level increases.  I don’t think you can cite education in and of itself that causes women to choose contraception, because perhaps it’s higher socio-economic status which leads them to pursue higher education which leads them to choose contraception.  I’m not sure this survey will answer that question, however!


contraceptive_choice_by_number_of_kids prop

Finally, we have number of kids, which exhibits an odd relationship with contraceptive choice. It appears as though at least half of the women with 1 child reported no contraception, but that proportion goes down when you look at women with 3 children.  After that, women are more and more likely to cite no contraception, likely reflecting their age.

Conclusion and observations:

As I mentioned earlier, I don’t expect to pick up any nobel prizes from this analysis.  Substantively, the most interesting thing that came out of this analysis for me is that it was stage of life factors (# kids and age) in addition to the wife’s education (and possibly income, but that wasn’t measured) which formed the most predictive variables in the classification of who uses which contraceptive methods.  Naively, I expected wife’s religion to be amongst the most predictive.

According to UNICEF, 6/10 drop-outs in primary school are girls.  That increases to 7/10 in secondary school.  Stop that trend from happening, and then who knows what improvements might result?  Perhaps continuing to invest in girls’ education will help lay the foundation for the later pursuit of higher education and the slowing down of their population expansion.

Lastly, I have some comments about plotly:

  1. It was a big buzz kill to discover that I couldn’t embed my plotly plots into my wordpress blog.  Bummer.
  2. The plots that did result were very nice looking!
  3. I found myself getting impatient figuring out where to click to customize my plots to my liking.  There’s something about going from a command line to a gui which is odd for me!
  4. I initially thought it was cool that I could save a theme for my plot once I customized it to my liking.  I was disappointed to learn that the theme was not saved as an absolute set of visual characteristics.  For some reason those characteristics changed depending on the initial graph I imported from R and I could not just apply it and be done with it.
  5. I found myself wondering if I there was a better way of getting my R graphs into plotly than the proscribed py$ggplotly() method.  It’s not my biggest complaint, but somehow I’d rather just have one command to batch upload all of my graphs.

I’ll be watching plotly as it evolves and hope to see it improve even more than it has already!  Good luck!

Predictive modelling fun with the caret package

I’m back!  6 months after my second child was born, I’ve finally made it back to my blog with something fun to write about.  I recently read through the excellent Machine Learning with R ebook and was impressed by the caret package and how easy it made it seem to do predictive modelling that was a little more than just the basics.

With that in mind, I went searching through the UCI machine learning repository and found a dataset about leaves that looked promising for a classification problem.  The dataset comprises of leaves from almost 40 different plant species, and has 14 numerical attributes describing each leaf.  It comes with a pdf file that shows pretty pictures of each leaf for the botanists out there, and some very mathematics heavy descriptions of each of the attributes which I couldn’t even hope to understand with my lack of education on the matter!

Seeing that it didn’t look overly complex to process, I decided to load it in and set up the overall training parameters:

leaf = read.csv("leaf.csv", colClasses = c(Class = "factor"))
ctrl = trainControl(method="repeatedcv", number=10, repeats=5, selectionFunction = "oneSE")
in_train = createDataPartition(leaf$Class, p=.75, list=FALSE)

First, I made sure that the Class variable remained a factor, even though it’s coded with integers in the incoming data.  This way once I split the data into a test set, I won’t get any complaints about missing outcome values if the sampling doesn’t pick up one of those values!

You’ll notice I’ve tried repeated cross validation here, with 5 repeats, and have used the ‘oneSE’ selection function.  This ensures that for whichever model I choose, the model gets tested on 10 different parts of my data, repeated 5 times over, and then I’ve chosen the ‘oneSE’ function to hopefully select a model that is not the most complex.  Finally, I use createDataPartition to create a a training sample of 75% of the data.

trf = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="rf", metric="Kappa",
            trControl=ctrl, subset = in_train)

tgbm = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="gbm", metric="Kappa",
            trControl=ctrl, subset = in_train, verbose=FALSE)

I’ve chosen to use a random forest and a generalized boosted model to try to model leaf class.  Notice how I’ve referred to the training parameters in the trControl argument, and have selected the training subset by referring to in_train.  Also, the ‘verbose=FALSE’ argument in the gbm model is important!!  Let’s look at results:

For the trf model:

Random Forest
340 samples
15 predictors
30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 228, 231, 233, 233, 232, 229, ...

Resampling results across tuning parameters:

mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.7341953 0.7230754 0.07930583 0.08252806
8 0.7513803 0.7409347 0.08873493 0.09237854
14 0.7481404 0.7375215 0.08438226 0.08786254

Kappa was used to select the optimal model using the one
SE rule.
The final value used for the model was mtry = 8.

So as you can see it’s selected a random forest model that tries 8 random predictors at each split, and it seems to be doing pretty well with a Kappa of .74. Now let’s move on to the next results:

For the tgbm model:

Stochastic Gradient Boosting 

340 samples
 15 predictors
 30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 226, 231, 229, 231, 228, 231, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD  
  1                   50      0.6550713  0.6406862  0.07735511   0.08017461
  1                  100      0.6779153  0.6646128  0.07461615   0.07739666
  1                  150      0.6799633  0.6667613  0.08291638   0.08592416
  2                   50      0.7000791  0.6876577  0.08467911   0.08771728
  2                  100      0.6984858  0.6860858  0.08711523   0.09041647
  2                  150      0.6886874  0.6759011  0.09157694   0.09494201
  3                   50      0.6838721  0.6708396  0.08850382   0.09166051
  3                  100      0.6992044  0.6868055  0.08423577   0.08714577
  3                  150      0.6976292  0.6851841  0.08414035   0.08693979

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Kappa was used to select the optimal model using  the one SE rule.
The final values used for the model were n.trees = 50, interaction.depth = 2 and shrinkage = 0.1.

Here we see that it has chosen a gbm model with an interaction depth of 2 and 50 trees. This has a kappa of .69, which appears somewhat worse than the random forest model. Let’s do a direct comparison:

resampls = resamples(list(RF = trf,
                          GBM = tgbm))

difValues = diff(resampls)

summary.diff.resamples(object = difValues)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

    RF        GBM    
RF            0.05989
GBM 0.0003241        

    RF        GBM    
RF            0.06229
GBM 0.0003208  

Sure enough, the difference is statistically significant. The GBM value ends up being less accurate than the random forest model. Now let’s go to the testing stage! You’ll notice I’ve now stuck with the random forest model.

test = leaf[-in_train,]
test$pred.leaf.rf = predict(trf, test, "raw")
confusionMatrix(test$pred.leaf.rf, test$Class)

Overall Statistics
               Accuracy : 0.7381         
                 95% CI : (0.6307, 0.828)
    No Information Rate : 0.0833         
    P-Value [Acc > NIR] : < 2.2e-16      
                  Kappa : 0.7277         
 Mcnemar's Test P-Value : NA      

Please excuse the ellipses above as the confusionMatrix command generates voluminous output! Anyway, sure enough the Kappa statistic was not that far off in the test sample as it was from the training sample (recall it was .74). Also of interest to me (perhaps it’s boring to you!) is the No Information Rate. Allow me to explain: If I take all of the known classes in the testing sample, and just randomly guess which records to which they belong, I will probably get some right. And this is exactly what the No Information Rate is; the proportion of classes that you would guess right if you randomly allocated them. Obviously an accuracy of .74 and a Kappa of .73 are way higher than the No Information Rate, and so I’m happy that the model is doing more than just making lucky guesses!

Finally, caret has a function to calculate variable importance so that you can see which variables were the most informative in making distinctions between classes.  The results for the random forest model follow:

varImp(trf, scale=FALSE)
rf variable importance

Solidity         31.818
Aspect_Ratio     26.497
Eccentricity     23.300
Elongation       23.231
Isoperimetric    20.001
Entropy          18.064
Lobedness        15.608
Max_Ind_Depth    14.828
Uniformity       14.092
Third_Moment     13.148
Stoch_Convexity  12.810
Avg_Intensity    12.438
Smoothness       10.576
Avg_Contrast      9.481

As I have very little clue what these variables mean from their descriptions, someone much wiser than me in all things botanical would have to chime in and educate me.

Well, that was good fun! If you have any ideas to keep the good times rolling and get even better results, please chime in by commenting 🙂