Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees

Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees.  Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I’ve worked with a lot) for binary classification.

The dataset I used contains records of the survival of Titanic Passengers and such information as sex, age, fare each person paid, number of parents/children aboard, number of siblings or spouses aboard, passenger class and other fields (The titanic dataset can be retrieved from a page on Vanderbilt’s website replete with lots of datasets; look for “titanic3″).

I took one part of the dataset to train my models, and another part to test them.  The factors that I focused on were passenger class, sex, age, and number of siblings/spouses aboard.

First, let’s look at the GLM:

titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)

As you can see, I worked in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women).  Following is the model summary output, if you’re interested:

> summary(titantic.survival.train)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex + age + sibsp, 
    family = binomial(logit), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9342  -0.6851  -0.5481   0.5633   2.3164  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     6.556807   0.878331   7.465 8.33e-14 ***
pclass         -1.928538   0.278324  -6.929 4.24e-12 ***
sexmale        -4.905710   0.785142  -6.248 4.15e-10 ***
age            -0.036462   0.009627  -3.787 0.000152 ***
sibsp          -0.264190   0.106684  -2.476 0.013272 *  
pclass:sexmale  1.124111   0.299638   3.752 0.000176 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 858.22  on 649  degrees of freedom
Residual deviance: 618.29  on 644  degrees of freedom
AIC: 630.29

Number of Fisher Scoring iterations: 5

So, after I used my model to predict survival probabilities on the testing portion of the dataset, I checked to see how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors.  For the GLM, 146/164 (89%) of those records scored at 50% or higher were actual survivors.  Not bad!

Now let’s move on to Random Forests:

> titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000, importance=TRUE)

> titanic.survival.train.rf

Call:
 randomForest(formula = as.factor(survived) ~ pclass + sex + age +      sibsp, data = titanic.train, ntree = 5000, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 5000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 22.62%
Confusion matrix:
    0   1 class.error
0 370  38  0.09313725
1 109 133  0.45041322

> importance(titanic.survival.train.rf)
               0          1 MeanDecreaseAccuracy MeanDecreaseGini
pclass  67.26795 125.166721            126.40379         34.69266
sex    160.52060 221.803515            224.89038         62.82490
age     70.35831  50.568619             92.67281         53.41834
sibsp   60.84056   3.343251             52.82503         14.01936

It seems to me that the output indicates that the Random Forests model is better at creating true negatives than true positives, with regards to survival of the passengers, but when I asked for the predicted survival categories in the testing portion of my dataset, it appeared to do a pretty decent job predicting who would survive and who wouldn’t:

For the Random Forests model, 155/184 (84%) of those records predicted to survive actually did survive!  Again, not bad.

Finally, lets move on to the Conditional Tree model:

> titanic.survival.train.ctree = ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train)

> titanic.survival.train.ctree

	 Conditional inference tree with 7 terminal nodes

Response:  as.factor(survived) 
Inputs:  pclass, sex, age, sibsp 
Number of observations:  650 

1) sex == {female}; criterion = 1, statistic = 141.436
  2) pclass <= 2; criterion = 1, statistic = 55.831
    3) pclass <= 1; criterion = 0.988, statistic = 8.817       
     4)*  weights = 74      3) pclass > 1
      5)*  weights = 47 
  2) pclass > 2
    6)*  weights = 105 
1) sex == {male}
  7) pclass <= 1; criterion = 1, statistic = 15.095     
   8)*  weights = 88    
  7) pclass > 1
    9) age <= 12; criterion = 1, statistic = 14.851
      10) sibsp <= 1; criterion = 0.998, statistic = 12.062               11)*  weights = 18        
      10) sibsp > 1
       12)*  weights = 12 
    9) age > 12
      13)*  weights = 306

Titanic Conditional Tree

I really happen to like the graph output by plotting the conditional tree model.  I find it pretty easy to understand.  As you can see, the model started the split of the data according to sex, which it found to be most significant, then pclass, age, and then siblings/spouses.  That being said, let’s look at how the prediction went for comparison purposes:

134/142 (94%) of records predicted to survive actually did survive!  Great!!  Now let’s bring all those prediction stats into one table for a final look:

                     glm randomForests ctree
Actually Survived    146           155   134
Predicted to Survive 164           184   142
% Survived           89%           84%   94%

So, in terms of finding the highest raw number of folks who actually did survive, the Random Forests model won, but in terms of having the highest true positive rate, the Conditional Tree model takes the cake!  Neat exercise!

Notes:

(1) There were numerous records without an age value.  I estimated ages using a regression model taken from those who did have ages.  Following are the coefficients for that model:

44.59238 + (-5.98582*pclass) + (.15971*fare) + (-.14141*pclass:fare)

(2) Although I used GLM scores of over .5 to classify records as survived, I could have played with that to get different results.  I really just wanted to have some fun, and not try all possibilities here:)

(3) I hope you enjoyed this post!  If you have anything to teach me about Random Forests or Conditional Trees, please leave a comment and be nice (I’m trying to expand my analysis skills here!!).

EDIT:
I realized that I should also post the false negative rate for each model, because it’s important to know how many people each model missed coding as survived. So, here are the stats:

                           glm    rf ctree   gbm
Actually Survived          112   103   124    74
Predicted Not to Survive   495   475   517   429
False Negative Rate      22.6% 21.7%   24% 17.2%

As you can see, the gbm model shows the lowest false negative rate, which is pretty nice! Compare that against the finding that it correctly predicted the survival of 184/230 (80%) records that were scored with a probability of 50% or higher, and that makes a pretty decent model.

28 thoughts on “Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees

  1. Nice trick, i would recommend to also look at http://en.wikipedia.org/wiki/Gradient_boosting

    In short it will try to classify survivors, but add the mistakes to the earlier population. This way the complicated cases will be added into the mix, and the regression trees will pay more attention to them.
    I think it would fit right up with the random forest and the conditional tress, but your take would be interesting :)

    • Okay I checked out the gbm package and attempted to fit my training data using it. Here we go:

      titanic.survival.train.gbm = gbm(survived ~ pclass + sex + age + sibsp, data=titanic.train, distribution=”bernoulli”,n.trees=5000, verbose=TRUE)

      Using this model, I found the following summary stats (I really like that it breaks down relative influence into percents!):

      var rel.inf
      1 sex 57.567932
      2 pclass 29.718241
      3 age 8.550585
      4 sibsp 4.163243

      Then finally, 184/230 (80%) of records scored as over 50% likely to survive actually did survive. Neat!!

  2. For the GLM model:

    titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
    family = binomial(logit), data = titanic.train)

    What does pclass:sex mean? Are they multiplied together to create a cross term?

    Thanks!

    • Hi there,

      pclass by sex is an interaction term, whereby pclass is multiplied by sex. Pclass has 3 values (1,2,3) and sex has two (female/male, or 0,1). So when you multiply them together, you get 4 values: 0, 1, 2, 3. The values above 0 only show up when a record is marked as male, hence the variable title “pclass:sexmale” in the GLM output above.

      If there were no special effect for passenger class for males, then I think pclass:sexmale should pretty much look like pclass and therefore wouldn’t be significant. However there is a special effect for passenger class for males (it’s much smaller than that of the females), so therefore it is significant!

  3. Why are you giving your Logistic Regression the benefit of the interaction term, but none of the other models? Wouldn’t it be fair to either remove that interaction, or give it as a term for the other models as well?

    • I tried to put the interaction term in the other models as well, but they just ignore it. I think that by their nature, decision tree type models already discover interactions between variables without you having to specify them!

      I also tried to put the interaction term in the gbm model I posted in my comment above, but it came out as having 0 relative importance!

      • That is correct, decision tree type models do discover their own interactions, as would a support vector machine (another interesting one to try on this data…).

        Very interesting post! I’m going to play around with this data some myself, now.

      • Having the interaction in the GLM/LR model means you’re applying ‘expert’ knowledge – something that isn’t possible or as necessary in the tree models. How does your GLM model compare to the trees if you don’t include the expert knowledge of the interaction term?

      • Believe it or not, taking out the interaction term actually improved the prediction quite a bit:

        188/232 (81%) records with a score of over 50% were actual survivors, whereas 70/427 (16%) records with a score of 50% or below were actual survivors.

        So it has a pretty good true positive rate, and a pretty low false negative rate!

  4. Pingback: Who Survived on the Titanic? Predictive Classification with Parametric and Non-parametric Models « Political Methodology

    • Hi there,

      First I had to predict the probabilities for the test sample:
      titanic.test2$probs.glm = predict(titantic.survival.train, titanic.test2, type=”response”)

      Once I had those, then I could check how many showed a probability over .5:
      sum(titanic.test2$probs.glm > .5)

      And then I could get the number of those who were actual survivors:
      sum(titanic.test2$probs.glm > .5 & titanic.test2$survived == 1)

  5. Pingback: My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis | Data and Analysis with R, at Work

  6. Great post! Question: Why didn’t you treat pclass as a categorical variable in your logistic regression? Would results change? Any additional commentary regarding this detail would be much appreciated.

    • Hola chilito,

      If I modify the GLM I made for “thebigjc” (the one without the interaction term that was making my survival estimates too conservative) so that passenger class is a factor rather than numeric variable, I don’t get too much improvement in the stats:

      193/240 (80.4%) of records with a score of over 50% actually survived, whereas 65/419 (16%) of records with a score of 50% or below were survivors. The true positive and false negative rates basically stay the same, even though the raw numbers change a little bit.

      I think that you can reasonably assume here that passenger class is an ordinal variable with a linear effect on survival rate, and therefore it’s not so bad to model it as numeric!

  7. Pingback: My Intro to Multiple Classification… « Another Word For It

  8. Pingback: Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees | Process Improvement Stuff I want to Know | Scoop.it

  9. Hi,
    I saw this on the R-Blogger’s email list back in December, and have finally gotten around to playing with this data!

    I’ll apologize in advance for the total noob question…after your ran the model on your training set, how did you run (or cross-validate) the model on your test set?

    FWIW, I found the “caret” package to be helpful with taking care of the mundate work of dividing up a sample into a training and test subset, and found the “pscl” package and its “hitmiss” function to easily compute the percentage correctly classified.

    Thanks again for such an interesting blog post!

    • Hi there,

      Thanks for your comment! I’ve never tried the packages you mentioned before. Breaking up datasets into train/test samples though is important to me. I’ll be sure to try the caret package next time I do some modelling!

  10. hi!
    can you please provide me a code for “confusion matrix” (logistic regression) in R in order to finish finally my homework :(

    Thanks,
    nicole

    • Install the R package “caret” and then type in
      ?confusionMatrix

      When you are done reading that, then just enter the command in a two step process–first, obtain a prediction model, then use the confusionMatrix command on the model to get the stats on accuracy of the model. So something like
      your.Model <- predict(model.developed.using.training.set, newdata=your.training.set) #that should be one line
      confusionMatrix(your.Model, true.state) #true.state would be something like "titantic$Survived").

      As always, Google is your friend.

  11. A few thinks on your Random Forest model.

    1) Did you check the error convergence on your RF model by plotting the model object? It is a misnomer that you cannot overfit a RF model. One sure way is to over-correlate the ensemble and 5000 trees seems like a bit overkill that may lead to an overfit problem.

    2) We have observed that model selection is important (Murphy et al., 2010; Evans et al., 2011). This is because RF does not have to account for noise in node splits making plurality in the votes matrix much cleaner.

    3) The standard probability cutoff for a RF model (for the positive class) is p=0.65 (Evans & Cushman (2008). I tested this critical value, via simulation when we came up with the posterior distribution method for deriving class probabilities and found that p=0.65 is very stable and consistent with the Breiman (2001) votes threshold.

    4) Random Forest (and Conditional Tree) is non-linear and accounts for high-dimensional interactions. As such, it is a very bad idea to withhold data. This is because you have no idea how the data is being truncated in p-dimensional space. A nonlinear relationship can occur in a small portion of the distribution and you may unintentional remove the portion of the variable that is driving the process or defining the interaction. This is one of the problems in comparing linear and nonlinear models and should be a major consideration in interpreting results. Absolute accuracy can only tell you so much when modeling complex nonlinear data.

    4) How did the models compare in relation to the ROC/AUC and Kappa?

    • Hi Jeffrey,

      Thank you for your post/comment.

      You have stated that , ‘it is a very bad idea to withhold data’, especially for non-linear models. Do you suggest we use 100% of the data in the training set? OR do you have other recommendations on obtaining the test data?

  12. I would also add that you have a bit of a class balance (zero inflation) problem with 64% of observations in the “survived” column representing the negative class (n=1313, n[1]=449, n[0]=846).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s