Predicting Mobile Phone Prices

Recently a colleague of mine showed me a nauseating interactive scatterplot that plots mobile phones according to two dimensions of the user’s choice from a list of possible dimensions.  Although the interactive visualization was offensive to my tastes, the JSON data behind the visualization was intriguing.  It was easy enough to get the data behind it (see this link if you want an up to date copy and be sure to take out the “data=” from the start of the file! I pulled this data around noon on March 23rd.) so that I could start asking a simple question: Which of the available factors provided in the dataset were the most predictive of full mobile phone price?

I’ll present the graphs and then the predictive model first and then the code later on:

Price by OS and Brand:

Often when investigating a topic using data, we confirm things that we already knew to be true.  This is certainly the case here with price by OS and brand.  From the below boxplots we see that the bulk of iOS devices tend to be the most expensive, and that brand-wise Apple, Google, and Samsung seem to stick out.

Mobile Phone Price by Operating System

Mobile Phone Prices by Brand

Price by Storage Capacity, RAM, and SD Card Capacity:

Storage capacity is perhaps the least surprising to find as having such a sharply positive correlation with price. I think what is more surprising to me is that there aren’t more gradations of storage capacity in the higher range past 50 gigabytes.  I’m guessing this is because the bulk of these phones (bearing in mind roughly 90% of these phones are in fact smart phones) are catered towards lower income folks.  Can you guess which phones occupy the top-right-most position on the first graph?  If your answer involved the iPhone 6 then you’re right on two counts!

As you can see, the correlation between RAM and price is pretty linear (with phones costing $171.54 more for each additional gigabyte of RAM) and that between SD Card capacity and price is linear past the large group of phones with 0 SD Card capacity (with phones costing $3.64 more for each additional gigabyte of SD Card Capacity).

Price by Storage Capacity

Price by RAM

Price by SD Card

Price by Screen Size, Battery, and Weight:

The next factors that I think one would naturally think of when considering the price of a mobile phone are all related to how big the thing is. Smart phones these days have a lot of physical presence just by dint of their screen size alone. Add to the large screen size the batteries that are used to support such generous displays and you also get an impressive variety of weights to these phones.

In fact, for every additional inch of screen size to these phones, you can expect an additional .81504 ounces and 565.11 mAh of battery capacity. My own humble little smartphone (an HTC Desire 601) happens to be on the smaller and lighter side of the spectrum as far as screen size and weight goes (4.5 inches screen size, or 33rd percentile; 4.59 ounces or 26th percentile) but happens to have a pretty generous battery capacity all things considered (2100 mAh, or 56th percentile).

While positive correlations can be seen between Price and all these 3 factors, battery was the most correlated with Price, next to screen size and then weight.  There’s obviously a lot of variability in price when you look at the phones with the bigger screen sizes, as they probably tend to come packed with a variety of premium extra features that can be used to jack up the price.

Price by Screen Size

Price by Battery

Price by Weight

Putting it all together in a model:
Finally, let’s lump all of the factors provided in the data set into a model, and see how well it performs on a testing sample. I decided on an 80/20 training/testing split, and am of course using Max Kuhn’s fabulous caret package to do the dirty work. I ran a gbm model, shown below, and managed to get an R squared of 60.4% in the training sample, so not bad.

Stochastic Gradient Boosting 

257 samples
 23 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 173, 173, 171, 171, 172, 171, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE      Rsquared   RMSE SD   Rsquared SD
  1                   50      150.1219  0.5441107  45.36781  0.1546993  
  1                  100      147.5400  0.5676971  46.03555  0.1528225  
  1                  150      146.3710  0.5803005  45.00296  0.1575795  
  2                   50      144.0657  0.5927624  45.46212  0.1736994  
  2                  100      143.7181  0.6036983  44.80662  0.1787351  
  2                  150      143.4850  0.6041207  45.57357  0.1760428  
  3                   50      148.4914  0.5729182  45.27579  0.1903465  
  3                  100      148.5363  0.5735842  43.41793  0.1746064  
  3                  150      148.8497  0.5785677  43.39338  0.1781990  

Tuning parameter 'shrinkage' was held constant at a value of 0.1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 150, interaction.depth = 2 and shrinkage = 0.1.

Now let’s look at the terms that came out as the most significant in the chosen model.  Below we see some unsurprising findings! Storage, battery, weight, RAM, and whether or not the phone uses iOS as the top 5. I guess I’m surprised that screen size was not higher up in the priority list, but at least it got in 6th place!

gbm variable importance

  only 20 most important variables shown (out of 41)

                  Overall
att_storage      100.0000
att_battery_mah   59.7597
att_weight        46.5410
att_ram           27.5871
att_osiOS         26.9977
att_screen_size   21.1106
att_sd_card       20.1130
att_brandSamsung   9.1220

Finally, let’s look at how our model did in the testing sample. Below I’ve shown you a plot of actual versus predicted price values. The straight line is what we would expect to see if there were a perfect correlation between the two (obviously not!!) while the smoothed line is the trend that we actually do see in the scatter plot. Considering the high R squared in the testing sample of 57% (not too far off from the training sample) it’s of course a nice confirmation of the utility of this model to see the smooth line following that perfect prediction line, but I won’t call be calling up Rogers Wireless with the magical model just yet!

Price by Predicted Price

In fact, before I close off this post, it would be remiss of me not to investigate a couple of cases in this final graph that look like outliers. The one on the bottom right, and the one on the top left.

The one on the bottom right happens to be a Sony Xperia Z3v Black with 32GB of storage space. What I learned from checking into this is that since the pricing data on the source website is pulled from amazon.com, sometimes instead of pulling the full regular price, it happens to pull the data on a day when a special sale or service agreement price is listed. When I pulled the data, the Xperia was listed at a price of $29.99. Today, on April 6th, the price that you would get if you looked it up through the source website is .99! Interestingly, my model had predicted a full price of $632.17, which was not very far off from the full price of $599.99 that you can see if you go on the listing on amazon.com. Not bad!

Now, how about the phone that cost so much but that the model said shouldn’t? This phone was none other than the Black LG 3960 Google Nexus 4 Unlocked GSM Phone with 16GB of Storage space. The price I pulled that day was a whopping $699.99 but the model only predicted a price of $241.86! Considering the specs on this phone, the only features that really seem to measure up are the storage (16GB is roughly in the 85th percentile for smart phones) and the RAM (2 GB is roughly in the 93rd percentile for smart phones). Overall though, the model can’t account for any other qualities that Google might have imbued into this phone that were not measured by the source website. Hence, this is a formidable model outlier!

If you take out the Sony Xperia that I mentioned first, the Adjusted R squared value goes up from 57% to 74%, and the Residual Standard Error decreases from $156 to $121. That’s a lot of influence for just one outlier that we found to be based on data quality alone. Wow!

Reflecting on this exercise, the one factor that I wished were collected is processor speed.  I’m curious how much that would factor into pricing decisions, but alas this information was unavailable.

Anyway, this was fun, and I hope not too boring for you, the readers. Thanks for reading!!

Contraceptive Choice in Indonesia

I wanted yet another opportunity to get to use the fabulous caret package, but also to finally give plot.ly a try.  To scratch both itches, I dipped into the UCI machine learning library yet again and came up with a survey data set on the topic of contraceptive choice in Indonesia.  This was an interesting opportunity for me to learn about a far-off place while practicing some fun data skills.

According to recent estimates, Indonesia is home to some 250 million individuals and over the years, thanks to government intervention, has had its fertility rate slowed down from well over 5 births per woman, to a current value of under 2.4 births per woman.  Despite this slow down, Indonesia is not generating enough jobs to satisfy the population.  Almost a fifth of their youth labour force (aged 15-24) are unemployed (19.6%), a whole 6.3% more than a recent estimate of the youth unemployment rate in Canada (13.3%).  When you’re talking about a country with 250 million individuals (with approximately 43.4 million 15-24 year olds), that’s the difference between about 5.8 million unemployed and 8.5 million unemployed teenagers/young adults.  The very idea is frightening!  Hence the government’s focus (decades ago and now) on promoting contraceptive method use to its population.

That is the spirit behind the survey data we’ll look at today!  First, download the data from my plot.ly account and load it into R.

library(plotly)
library(caret)
library(dplyr)
library(reshape2)
library(scales)

cmc = read.csv("cmc.csv", colClasses = c(Wife.Edu = "factor", Husband.Edu = "factor", Wife.Religion = "factor", 
                                         Wife.Working = "factor", Husband.Occu = "factor", Std.of.Living = "factor",
                                         Media.Exposure = "factor", Contraceptive.Method.Used = "factor"))

levels(cmc$Contraceptive.Method.Used) = c("None","Short-Term","Long-Term")

table(cmc$Contraceptive.Method.Used)
      None Short-Term  Long-Term 
       629        333        511 

prop.table(table(cmc$Contraceptive.Method.Used))

      None Short-Term  Long-Term 
 0.4270197  0.2260692  0.3469111 

Everything in this data set is stored as a number, so the first thing I do is to define as factors what the documentation suggests are factors.  Then, we see the numeric breakdown of how many women fell into each contraceptive method category.  It’s 1473 responses overall, and ‘None’ is the largest category, although it is not hugely different from the others (although a chi squared test does tell me that the proportions are significantly different from one another).

Next up, let’s set up the cross validation and training vs testing indices, then train a bunch of models, use a testing sample to compare performance, and then we’ll see some graphs

# It's training time!
control = trainControl(method = "cv")
in_train = createDataPartition(cmc$Contraceptive.Method.Used, p=.75, list=FALSE)

contraceptive.model = train(Contraceptive.Method.Used ~ ., data=cmc, method="rf", metric="Kappa",
                            trControl=control, subset=in_train)

contraceptive.model.gbm = train(Contraceptive.Method.Used ~ ., data=cmc, method="gbm", metric="Kappa",
                            trControl=control, subset=in_train, verbose=FALSE)

contraceptive.model.svm = train(Contraceptive.Method.Used ~ ., data=cmc, method="svmRadial", metric="Kappa",
                                preProc = c('center','scale'), trControl=control, subset=in_train)

contraceptive.model.c50 = train(Contraceptive.Method.Used ~ ., data=cmc, method="C5.0", metric="Kappa",
                                trControl=control, subset=in_train, verbose=FALSE)


dec = expand.grid(.decay=c(0,.0001,.01,.05,.10))
control.mreg = trainControl(method = "cv")
contraceptive.model.mreg = train(Contraceptive.Method.Used ~ ., data=cmc, method="multinom", metric="Kappa",
                                  trControl=control.mreg, tuneGrid = dec, subset=in_train, verbose=FALSE)

# And now it's testing time...
cmc.test = cmc[-in_train,]

cmc.test = cmc.test %>% mutate(
  rf.class = predict(contraceptive.model, cmc.test, type="raw"),
  gbm.class = predict(contraceptive.model.gbm, cmc.test, type="raw"),
  svm.class = predict(contraceptive.model.svm, cmc.test, type="raw"),
  mreg.class = predict(contraceptive.model.mreg, cmc.test, type="raw"),
  c50.class = predict(contraceptive.model.c50, cmc.test, type="raw"))

# Here I'm setting up a matrix to host some performance statistics from each of the models
cmatrix.metrics = data.frame(model = c("rf", "gbm", "svm", "mreg","c50"), kappa = rep(NA,5), sensitivity1 = rep(NA,5), sensitivity2 = rep(NA,5), sensitivity3 = rep(NA,5), specificity1 = rep(NA,5), specificity2 = rep(NA,5), specificity3 = rep(NA,5))

# For each of the models, I use confusionMatrix to give me the performance statistics that I want
for (i in 11:15) {
  cmatrix.metrics[i-10,"kappa"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$overall[2][[1]]
  cmatrix.metrics[i-10, "sensitivity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,1]
  cmatrix.metrics[i-10, "sensitivity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,1]
  cmatrix.metrics[i-10, "sensitivity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,1]
  cmatrix.metrics[i-10, "specificity1"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[1,2]
  cmatrix.metrics[i-10, "specificity2"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[2,2]
  cmatrix.metrics[i-10, "specificity3"] = confusionMatrix(cmc.test$Contraceptive.Method.Used, cmc.test[,i])$byClass[3,2]
}

# Now I transform my cmatrix.metrics matrix into a long format suitable for ggplot, graph it, and then post it to plot.ly

cmatrix.long = melt(cmatrix.metrics, id.vars=1, measure.vars=c(2:8))

ggplot(cmatrix.long, aes(x=model, y=value)) + geom_point(stat="identity", colour="blue", size=3) + facet_grid(~variable) + theme(axis.text.x=element_text(angle=90,vjust=0.5, colour="black",size=12), axis.text.y=element_text(colour="black", size=12), strip.text=element_text(size=12), axis.title.x=element_text(size=14,face="bold"), axis.title.y=element_text(size=14, face="bold")) + ggtitle("Performance Stats for Contraceptive Choice Models") + scale_x_discrete("Model") + scale_y_continuous("Value")

set_credentials_file("my.username","my.password")
py = plotly()
py$ggplotly()

And behold the beautiful graph (after some tinkering in the plot.ly interface, of course)

performance_stats_for_contraceptive_choice_modelsPlease note, the 1/2/3 next to sensitivity and specificity refer to the contraceptive use classes, ‘None’, ‘Short-term’, ‘Long-term’.

Firstly, the kappa stats are telling us that after you factor into accuracy the results you would expect by chance, the models don’t add a humongous level of predictive power.  So, we won’t collect our nobel prize on this one!  The classes are evidently more difficult to positively identify than they are to negatively identify, as evidenced by the lower sensitivity scores than specificity scores.  Interestingly, class 1 (users of NO contraceptive methods) was the most positively identifiable.

Secondly, it looks like we have a list of the top 3 performers in terms of kappa, sensitivity and specificity (in no particular order because I they don’t appear that different):

  • gbm (gradient boosting machine)
  • svm (support vector machine)
  • c50

Now that we have a subset of models, let’s see what they have to say about variable importance.  Once we see what they have to say, we’ll do some more fun graphing to see the variables in action.

gbm.imp = varImp(contraceptive.model.gbm)
svm.imp = varImp(contraceptive.model.svm)
c50.imp = varImp(contraceptive.model.c50)

I haven’t used ggplot here because I decided to try copy pasting the tables directly into plot.ly for the heck of it. Let’s have a look at variable importance according to the three models now:

contraceptive_choice_model_-_gbm_variable_importance

contraceptive_choice_model_-_c50_variable_importance

contraceptive_method_model_-_svm_variable_importanceOne commonality that you see throughout all of these graphs is that the wife’s age, her education, and the number of children born to her are among the most predictive in terms of classifying where she will fall on the contraceptive choice spectrum provided in the survey.  Given that result, let’s see some graphs that plot contraceptive choice against the aforementioned 3 predictive variables:

plot(Contraceptive.Method.Used ~ Wife.Age, data=cmc, main="Contraceptive Method Used According to Age of Wife")

cmethod.by.kids = melt(dcast(cmc, Num.Kids ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records")

ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Number of Kids")
ggplot(cmethod.by.kids, aes(x=Num.Kids, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Number of Kids")

cmethod.by.wife.edu = melt(dcast(cmc, Wife.Edu ~ Contraceptive.Method.Used, value.var="Contraceptive.Method.Used", fun.aggregate=length), id.vars=1,measure.vars=c(2:4), variable.name="Contraceptive.Method.Used", value.name="Num.Records")

ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(position="fill", stat="identity") + scale_y_continuous(labels=percent) + ggtitle("Contraceptive Choice by Education Level of Wife")
ggplot(cmethod.by.wife.edu, aes(x=Wife.Edu, y=Num.Records, fill=Contraceptive.Method.Used)) + geom_bar(stat="identity") + ggtitle("Contraceptive Choice by Education Level of Wife")

And now the graphs:

Contraceptive Choice by Age of Wife

Okay so this isn’t a plotly graph, but I do rather like the way this mosaic plot conveys the information.  The insight here is after a certain age (around age 36) the women seem to be less inclined to report the use of long-term contraceptive methods, and more inclined to report no contraceptive use at all.  Could it be that the older women feel that they are done with child bearing, and so do not feel the need for contraception anymore?  Perhaps someone more knowledgeable on the matter could enlighten me!

contraceptive_choice_by_education_level_of_wife

contraceptive_choice_by_education_level_of_wife prop

Here’s a neat one, showing us the perhaps not-so-revelatory result that as the education level of the wife increases, the likelihood that she will report no contraception diminishes.  Here it is in fact the short term contraceptive methods where we see the biggest increase in likelihood as the education level increases.  I don’t think you can cite education in and of itself that causes women to choose contraception, because perhaps it’s higher socio-economic status which leads them to pursue higher education which leads them to choose contraception.  I’m not sure this survey will answer that question, however!

contraceptive_choice_by_number_of_kids

contraceptive_choice_by_number_of_kids prop

Finally, we have number of kids, which exhibits an odd relationship with contraceptive choice. It appears as though at least half of the women with 1 child reported no contraception, but that proportion goes down when you look at women with 3 children.  After that, women are more and more likely to cite no contraception, likely reflecting their age.

Conclusion and observations:

As I mentioned earlier, I don’t expect to pick up any nobel prizes from this analysis.  Substantively, the most interesting thing that came out of this analysis for me is that it was stage of life factors (# kids and age) in addition to the wife’s education (and possibly income, but that wasn’t measured) which formed the most predictive variables in the classification of who uses which contraceptive methods.  Naively, I expected wife’s religion to be amongst the most predictive.

According to UNICEF, 6/10 drop-outs in primary school are girls.  That increases to 7/10 in secondary school.  Stop that trend from happening, and then who knows what improvements might result?  Perhaps continuing to invest in girls’ education will help lay the foundation for the later pursuit of higher education and the slowing down of their population expansion.

Lastly, I have some comments about plotly:

  1. It was a big buzz kill to discover that I couldn’t embed my plotly plots into my wordpress blog.  Bummer.
  2. The plots that did result were very nice looking!
  3. I found myself getting impatient figuring out where to click to customize my plots to my liking.  There’s something about going from a command line to a gui which is odd for me!
  4. I initially thought it was cool that I could save a theme for my plot once I customized it to my liking.  I was disappointed to learn that the theme was not saved as an absolute set of visual characteristics.  For some reason those characteristics changed depending on the initial graph I imported from R and I could not just apply it and be done with it.
  5. I found myself wondering if I there was a better way of getting my R graphs into plotly than the proscribed py$ggplotly() method.  It’s not my biggest complaint, but somehow I’d rather just have one command to batch upload all of my graphs.

I’ll be watching plotly as it evolves and hope to see it improve even more than it has already!  Good luck!

Predictive modelling fun with the caret package

I’m back!  6 months after my second child was born, I’ve finally made it back to my blog with something fun to write about.  I recently read through the excellent Machine Learning with R ebook and was impressed by the caret package and how easy it made it seem to do predictive modelling that was a little more than just the basics.

With that in mind, I went searching through the UCI machine learning repository and found a dataset about leaves that looked promising for a classification problem.  The dataset comprises of leaves from almost 40 different plant species, and has 14 numerical attributes describing each leaf.  It comes with a pdf file that shows pretty pictures of each leaf for the botanists out there, and some very mathematics heavy descriptions of each of the attributes which I couldn’t even hope to understand with my lack of education on the matter!

Seeing that it didn’t look overly complex to process, I decided to load it in and set up the overall training parameters:

library(caret)
leaf = read.csv("leaf.csv", colClasses = c(Class = "factor"))
ctrl = trainControl(method="repeatedcv", number=10, repeats=5, selectionFunction = "oneSE")
in_train = createDataPartition(leaf$Class, p=.75, list=FALSE)

First, I made sure that the Class variable remained a factor, even though it’s coded with integers in the incoming data.  This way once I split the data into a test set, I won’t get any complaints about missing outcome values if the sampling doesn’t pick up one of those values!

You’ll notice I’ve tried repeated cross validation here, with 5 repeats, and have used the ‘oneSE’ selection function.  This ensures that for whichever model I choose, the model gets tested on 10 different parts of my data, repeated 5 times over, and then I’ve chosen the ‘oneSE’ function to hopefully select a model that is not the most complex.  Finally, I use createDataPartition to create a a training sample of 75% of the data.

trf = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="rf", metric="Kappa",
            trControl=ctrl, subset = in_train)

tgbm = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="gbm", metric="Kappa",
            trControl=ctrl, subset = in_train, verbose=FALSE)

I’ve chosen to use a random forest and a generalized boosted model to try to model leaf class.  Notice how I’ve referred to the training parameters in the trControl argument, and have selected the training subset by referring to in_train.  Also, the ‘verbose=FALSE’ argument in the gbm model is important!!  Let’s look at results:

For the trf model:

Random Forest
340 samples
15 predictors
30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 228, 231, 233, 233, 232, 229, ...

Resampling results across tuning parameters:

mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.7341953 0.7230754 0.07930583 0.08252806
8 0.7513803 0.7409347 0.08873493 0.09237854
14 0.7481404 0.7375215 0.08438226 0.08786254

Kappa was used to select the optimal model using the one
SE rule.
The final value used for the model was mtry = 8.

So as you can see it’s selected a random forest model that tries 8 random predictors at each split, and it seems to be doing pretty well with a Kappa of .74. Now let’s move on to the next results:

For the tgbm model:

Stochastic Gradient Boosting 

340 samples
 15 predictors
 30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 226, 231, 229, 231, 228, 231, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD  
  1                   50      0.6550713  0.6406862  0.07735511   0.08017461
  1                  100      0.6779153  0.6646128  0.07461615   0.07739666
  1                  150      0.6799633  0.6667613  0.08291638   0.08592416
  2                   50      0.7000791  0.6876577  0.08467911   0.08771728
  2                  100      0.6984858  0.6860858  0.08711523   0.09041647
  2                  150      0.6886874  0.6759011  0.09157694   0.09494201
  3                   50      0.6838721  0.6708396  0.08850382   0.09166051
  3                  100      0.6992044  0.6868055  0.08423577   0.08714577
  3                  150      0.6976292  0.6851841  0.08414035   0.08693979

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Kappa was used to select the optimal model using  the one SE rule.
The final values used for the model were n.trees = 50, interaction.depth = 2 and shrinkage = 0.1.

Here we see that it has chosen a gbm model with an interaction depth of 2 and 50 trees. This has a kappa of .69, which appears somewhat worse than the random forest model. Let’s do a direct comparison:

resampls = resamples(list(RF = trf,
                          GBM = tgbm))

difValues = diff(resampls)
summary(difValues)

Call:
summary.diff.resamples(object = difValues)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Accuracy 
    RF        GBM    
RF            0.05989
GBM 0.0003241        

Kappa 
    RF        GBM    
RF            0.06229
GBM 0.0003208  

Sure enough, the difference is statistically significant. The GBM value ends up being less accurate than the random forest model. Now let’s go to the testing stage! You’ll notice I’ve now stuck with the random forest model.

test = leaf[-in_train,]
test$pred.leaf.rf = predict(trf, test, "raw")
confusionMatrix(test$pred.leaf.rf, test$Class)

...
Overall Statistics
                                         
               Accuracy : 0.7381         
                 95% CI : (0.6307, 0.828)
    No Information Rate : 0.0833         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.7277         
 Mcnemar's Test P-Value : NA      
...

Please excuse the ellipses above as the confusionMatrix command generates voluminous output! Anyway, sure enough the Kappa statistic was not that far off in the test sample as it was from the training sample (recall it was .74). Also of interest to me (perhaps it’s boring to you!) is the No Information Rate. Allow me to explain: If I take all of the known classes in the testing sample, and just randomly guess which records to which they belong, I will probably get some right. And this is exactly what the No Information Rate is; the proportion of classes that you would guess right if you randomly allocated them. Obviously an accuracy of .74 and a Kappa of .73 are way higher than the No Information Rate, and so I’m happy that the model is doing more than just making lucky guesses!

Finally, caret has a function to calculate variable importance so that you can see which variables were the most informative in making distinctions between classes.  The results for the random forest model follow:

varImp(trf, scale=FALSE)
rf variable importance

                Overall
Solidity         31.818
Aspect_Ratio     26.497
Eccentricity     23.300
Elongation       23.231
Isoperimetric    20.001
Entropy          18.064
Lobedness        15.608
Max_Ind_Depth    14.828
Uniformity       14.092
Third_Moment     13.148
Stoch_Convexity  12.810
Avg_Intensity    12.438
Smoothness       10.576
Avg_Contrast      9.481

As I have very little clue what these variables mean from their descriptions, someone much wiser than me in all things botanical would have to chime in and educate me.

Well, that was good fun! If you have any ideas to keep the good times rolling and get even better results, please chime in by commenting :)

Data Until I Die: My blog title and statement of values

When I started keeping this Blog, my intent was to write about and keep helpful snippets of R code that I used in the line of work.  It was the start of my second job after grad school and I was really excited about getting to use R on a regular basis outside of academia!  Well, time went on and so did the number of posts I put on here.  After a while the posts related to work started to dip considerably.  Then, I found my third post grad-school job, which shifted the things I needed R for at work.  I still use R at work, but not for exactly the same things.

After I got my third job, I noticed that all my blog posts were for fun, and not for work.  That’s when I fiddled a little bit with my blog title to incorporate the concept of ‘fun’ in there.  Now that I’ve carried on these ‘for fun’ analyses which I’ve been posting every 1-2 months, I realize that it’s an obsession of mine that’s not going away.  Data, I realize, is a need of mine (probably more than sex, to be honest).  I work with data in the office, I play with data when I can find fun data sets every once in a while at home.

Data and data analysis are comfortably things that I can see myself doing for the rest of my life.  That’s why I decided to call this blog “Data Until I Die!”.  Sometimes I’ll post boring analyses, and sometimes I’ll post really interesting analyses.  The main thing is, this Blog is a good excuse to fuel my Data drive :)

Thanks for reading!

Ontario First Nations Libraries Compared Using Ontario Open Data

I recently downloaded a very cool dataset on Ontario libraries from the Ontario Open Data Catalogue.  The dataset contains 142 columns of information describing 386 libraries in Ontario, representing a fantastically massive data collection effort for such important cultural institutions (although the most recent information available is as of 2010).  One column which particularly caught my interest was “Library Service Type”, which breaks the libraries down into:

  • Public or Union Library (247)
  • LSB Library (4)
  • First Nations Library (43)
  • County, County co-operative or Regional Municipality Library (13)
  • Contracting Municipality (49)
  • Contracting LSB (14)

I saw the First Nations Library type and thought it would be really educational for me to compare First Nations libraries against all the other library types combined and see how they compare based on some interesting indicators.  To make these comparisons in this post, I use a few violin plots; where you see more bulkiness in the plot, it tells you that the value on the y axis is more likely for a library compared to the thinner parts.

Our first comparison, shown below, reveals that local population sizes are a LOT more variable amongst the “Other” library types compared to First Nations libraries.  From first to third quartile, First Nations libraries tend to have around 250 to 850 local residents, whereas Other libraries tend to have around 1,110 to 18,530 local residents!

Local Population Sizes by Library Type
_

             isFN.Library 0%    25%  50%   75%    100%
1         Other Libraries 28 1113.5 5079 18529 2773000
2 First Nations Libraries 55  254.5  421   857   11297

Considering the huge difference in the population sizes that these libraries were made to serve, comparisons between library types need to be weighted according to those sizes, so that the comparisons are made proportionate.  In that spirit, the next plot compares the distribution of the number of cardholders per resident by library type.  Thinking about this metric for a moment, it’s possible that a person not living in the neighbourhood of the library can get a card there.  If all the residents of the library’s neighbourhood have a card, and there are people outside of that neighbourhood with cards, then a library could have over 1 cardholder per resident.

Looking at the plot, a couple of things become apparent: Firstly, First Nations libraries appear more likely to to be overloaded with cardholders (more cardholders than there are local residents, 14% of First Nations libraries, vs. 4% of Other libraries).  On the lower end of the spectrum, First Nations libraries show a slight (non-significant) tendency of having fewer cardholders per resident than Other libraries.

Cardholders per Resident by Library Type_

             isFN.Library 0%  25%  50%  75% 100%
1         Other Libraries  0 0.20 0.37 0.55  2.1
2 First Nations Libraries  0 0.19 0.32 0.77  2.8

Next we’ll look at a very interesting metric, because it looks so different when you compare it in its raw form to when you compare it in proportion to population size.  The plot below shows the distribution of English titles in circulation by library type.  It shouldn’t be too surprising that Other libraries, serving population sizes ranging from small to VERY large, also vary quite widely in the number of English titles in circulation (ranging from around 5,600 to 55,000, from first to third quartile).  On the other hand we have First Nations libraries, serving smaller population sizes, varying a lot less in this regard (from around 1,500 to 5,600 from first to third quartile).
Num English Titles in Circulation by Library Type
_

             isFN.Library 0%    25%   50%   75%   100%
1         Other Libraries  0 5637.5 21054 54879 924635
2 First Nations Libraries  0 1500.0  3800  5650  25180

Although the above perspective reveals that First Nations libraries tend to have considerably fewer English titles in circulation, things look pretty different when you weight this metric by the local population size.  Here, the plot for First Nations libraries looks very much like a Hershey’s Kiss, whereas the Other libraries plot looks a bit like a toilet plunger.  In other words, First Nations libraries tend to have more English titles in circulation per resident than Other libraries.  This doesn’t say anything about the quality of those books available in First Nations libraries.  For that reason, it would be nice to have a measure even as simple as median/average age/copyright date of the books in the libraries to serve as a rough proxy for the quality of the books sitting in each library.  That way, we’d know whether the books in these libraries are up to date, or antiquated.
English Titles in Circulation per Resident by Library Type
_

             isFN.Library 0%       25%      50%       75%      100%
1         Other Libraries  0 0.9245169 2.698802  5.179767 119.61462
2 First Nations Libraries  0 2.0614922 7.436399 13.387416  51.14423

For the next plot, I took all of the “per-person” values, and normed them.  That is to say, for any given value on the variables represented below, I subtracted from that value the minimum possible value, and then divided the result by the range of values on that measure.  Thus, any and all values close to 1 are the higher values, and those closer to 0 are the lower values.  I then took the median value (by library type) for each measure, and plotted below.  Expressed this way, flawed though it may be, we see that First Nations Libraries tend to spend more money per local resident, across areas, than Other libraries.  The revenue side looks a bit different.  While they tend to get more revenue per local resident, they appear to generate less self-generated revenue, get fewer donations, and get less money in local operating grants, all in proportion to the number of local residents.  The three areas where they are excelling (again, this is a median measure) are total operating revenue, provincial operating funding, and especially project grants.
Normed Costs and Revenues by Library Type
Here I decided to zero in on the distributional differences in net profit per resident by library type.  Considering that libraries are non-profit institutions, you would expect to see something similar to the plot shown for “Other” libraries, where the overwhelming majority are at or around the zero line.  It’s interesting to me then, especially since I work with non-profit institutions, to see the crazy variability in the First Nations libraries plot.  The upper end of this appears to be from some outrageously high outliers, so I decided to take them out and replot.
Net Profit per Resident Population
In the plot below, I’ve effectively zoomed in, and can see that there do seem to be more libraries showing a net loss, per person, than those in the net gain status.
Normed Costs and Revenues by Library Type - Zoomed In
_

             isFN.Library      0%    25%   50%  75%   100%
1         Other Libraries -149.87  -0.49  0.00 1.16  34.35
2 First Nations Libraries  -76.55 -17.09 -0.88 0.40 250.54

I wanted to see this net profit per person measure mapped out across Ontario, so I used the wonderful ggmap package, which to my delight is Canadian friendly!  Go Canada!  In this first map, we see that First Nations libraries in Southern Ontario (the part of Ontario that looks like the head of a dragon) seem to be “okay” on this measure, with one library at the “neck” of the dragon seeming to take on a little more red of a shade, one further west taking on a very bright green, and a few closer to Manitoba appearing to be the worst.

Net Profit per Local Resident Amongst First Nations LibrariesTo provide more visual clarity on these poorly performing libraries, I took away all libraries at or above zero on this measure.  Now there are fewer distractions, and it’s easier to see the worst performers.

Net Profit per Local Resident Amongst First Nations Libraries - in the red

Finally, as a sanity check, I re-expressed the above measure into a ratio of total operating revenue to total operating expenditure to see if the resulting geographical pattern was similar enough.  Anything taking on a value of less than 1 is spending more than they are making in revenue, and are thus “in the red”.  While there are some differences in how the colours are arrayed across Ontario, the result is largely the same.Operating Revenue to Cost Ratio Amongst First Nations Libraries

Finally, I have one last graph that does seem to show a good-news story.  When I looked at the ratio of annual program attendance to local population size, I found that First Nations libraries seem to attract more people every year, proportionate to population size, compared to Other libraries!  This might have something to do with the draw of a cultural institution in a small community, but feel free to tell me some first hand stories either running against this result, or confirming it if you will:

Annual Program Attendance by Library Type

_

             
          isFN.Library    0%   25%   50%   75%   100%
1         Other Libraries  0 0.018 0.155 0.307  8.8017
2 First Nations Libraries  0 0.113 0.357 2.686  21.361

That’s it for now! If you have any questions, or ideas for further analysis, don’t hesitate to drop me a line :)

As a final note, I think that it’s fantastic that this data collection was done, but the fact that the most recent data available is as of 2010 is very tardy.  What happened here?  Libraries are so important across the board, so please, Ontario provincial government, keep up the data collection efforts!

A Delicious Analysis! (aka topic modelling using recipes)

A few months ago, I saw a link on twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing).  I thought this was really neat and became interested in potentially using the data for something slightly different; to figure out which ingredients tended to correlate across recipes.  I emailed one of the authors, Yong-Yeol Ahn, who is a real mensch by the way, and he let me know that the raw recipe data is readily available on his website!

Given my goal of looking for which ingredients correlate across recipes, I figured this would be the perfect opportunity to use topic modelling (here I use Latent Dirichlet Allocation or LDA).  Usually in topic modelling you have a lot of filtering to do.  Not so with these recipe data, where all the words (ingredients) involved in the corpus are of potential interest, and there aren’t even any punctuation marks!  The topics coming out of the analysis would represent clusters of ingredients that co-occur with one another across recipes, and would possibly teach me something about cooking (of which I know precious little!).

All my code is at the bottom, so all you’ll find up here are graphs and my textual summary.  The first thing I did was to put the 3 raw recipe files together using python.  Each file consisted of one recipe per line, with the cuisine of the recipe as the first entry on the line, and all other entries (the ingredients) separated by tab characters.  In my python script, I separated out the cuisines from the ingredients, and created two files, one for the recipes, and one for the cuisines of the recipes.

Then I loaded up the recipes into R and got word/ingredient counts.  As you can see below, the 3 most popular ingredients were egg, wheat, and butter.  It makes sense, considering the fact that roughly 70% of all the recipes fall under the “American” cuisine.  I did this analysis for novelty’s sake, and so I figured I would take those ingredients out of the running before I continued on.  Egg makes me fart, wheat is not something I have at home in its raw form, and butter isn’t important to me for the purpose of this analysis!

Recipe Popularity of Top 30 Ingredients

Here are the top ingredients without the three filtered out ones:

Recipe Popularity of Top 30 Ingredients - No Egg Wheat or Butter

Finally, I ran the LDA, extracting 50 topics, and the top 5 most characteristic ingredients of each topic.  You can see the full complement of topics at the bottom of my post, but I thought I’d review some that I find intriguing.  You will, of course, find other topics intriguing, or some to be bizarre and inappropriate (feel free to tell me in the comment section).  First, topic 4:

[1] "tomato"  "garlic"  "oregano" "onion"   "basil"

Here’s a cluster of ingredients that seems decidedly Italian.  The ingredients seem to make perfect sense together, and so I think I’ll try them together next time I’m making pasta (although I don’t like tomatoes in their original form, just tomato sauce).

Next, topic 19:

[1] "vanilla" "cream"   "almond"  "coconut" "oat"

This one caught my attention, and I’m curious if the ingredients even make sense together.  Vanilla and cream makes sense… Adding coconut would seem to make sense as well.  Almond would give it that extra crunch (unless it’s almond milk!).  I don’t know whether it would be tasty however, so I’ll probably pass this one by.

Next, topic 20:

[1] "onion"         "black_pepper"  "vegetable_oil" "bell_pepper"   "garlic"

This one looks tasty!  I like spicy foods and so putting black pepper in with onion, garlic and bell pepper sounds fun to me!

Next, topic 23:

[1] "vegetable_oil" "soy_sauce"     "sesame_oil"    "fish"          "chicken"

Now we’re into the meaty zone!  I’m all for putting sauces/oils onto meats, but putting vegetable oil, soy sauce, and sesame oil together does seem like overkill.  I wonder whether soy sauce shows up with vegetable oil or sesame oil separately in recipes, rather than linking them all together in the same recipes.  I’ve always liked the extra salty flavour of soy sauce, even though I know it’s horrible for you as it has MSG in it.  I wonder what vegetable oil, soy sauce, and chicken would taste like.  Something to try, for sure!

Now, topic 26:

[1] "cumin"      "coriander"  "turmeric"   "fenugreek"  "lemongrass"

These are a whole lot of spices that I never use on my food.  Not for lack of wanting, but rather out of ignorance and laziness.  One of my co-workers recently commented that cumin adds a really nice flavour to food (I think she called it “middle eastern”).  I’ve never heard a thing about the other spices here, but why not try them out!

Next, topic 28:

[1] "onion"       "vinegar"     "garlic"      "lemon_juice" "ginger"

I tend to find that anything with an intense flavour can be very appetizing for me.  Spices, vinegar, and anything citric are what really register on my tongue.  So, this topic does look very interesting to me, probably as a topping or a sauce.  It’s interesting that ginger shows up here, as that neutralizes other flavours, so I wonder whether I’d include it in any sauce that I make?

Last one!  Topic 41:

[1] "vanilla"  "cocoa"    "milk"     "cinnamon" "walnut"

These look like the kinds of ingredients for a nice drink of some sort (would you crush the walnuts?  I’m not sure!)

Well, I hope you enjoyed this as much as I did!  It’s not a perfect analysis, but it definitely is a delicious one :)  Again, feel free to leave any comments about any of the ingredient combinations, or questions that you think could be answered with a different analysis!

UofT R session went well. Thanks RStudio Server!

Apart from going longer than I had anticipated, very little of any significance went wrong during my R session at UofT on friday!  It took a while at the beginning for everyone to get set up.  Everyone was connecting to my home RStudio server via UofT’s wireless network.  This meant that if any students weren’t set up to use wireless in the first place (they get a username and password from the library, a UTORid) then they wouldn’t be able to connect period.  For those students who were able to connect, I assigned each of them one of 30 usernames that I had laboriously set up on my machine the night before.

After connecting to my server, I then got them to click on the ‘data’ directory that I had set up in each of their home folders on my computer to load up the data that I prepared for them (see last post).  I forgot that they needed to set the data directory as their working directory… woops, that wasted some time!  After I realized that mistake, things went more smoothly.

We went over data import, data indexing (although I forgot about conditional indexing, which I use very often at work… d’oh!), merging, mathematical operations, some simple graphing (a histogram, scatterplot, and scatterplot matrix), summary stats, median splits, grouped summary stats using the awesome dplyr, and then nicer graphing using the qplot function from ggplot2.

I was really worried about being boring, but I found myself getting more and more energized as the session went on, and I think the students were interested as well!  I’m so glad that the RStudio Server I set up on my computer was able to handle all of those connections at once and that my TekSavvy internet connection didn’t crap out either :)  This is definitely an experience that I would like to have again.  Hurray!

Here’s a script of the analysis I went through:

Here’s the data:

http://bit.ly/MClPmK