# My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

After the work I did for my last post, I wanted to practice doing multiple classification.  I first thought of using the famous iris dataset, but felt that was a little boring.  Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors.  Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books.  In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!!  The following table describes the numerical breakdown of books and chapters from each author:

 # Books # Chapters/Scenes Austen 5 317 London 6 296 Milton 2 55 Shakespeare 12 173

Overall, the dataset contains 841 rows and 71 columns.  69 of those columns are the counted stop words (wow!), 1 is for what’s called the “BookID”, and the last is for the Author.  I hope that the word counts are the number of times each word shows up per 100 words, or something that makes the counts comparable between authors/books.

The first thing I did after loading up the data into R was to create a training and testing set:

```> authorship\$randu = runif(841, 0,1)
> authorship.train = authorship[authorship\$randu < .4,]
> authorship.test = authorship[authorship\$randu >= .4,]```

Then I set out to try to predict authorship in the testing data set using a Random Forests model, a Conditional Inference Tree model, and a Linear Discriminant Analysis model.

Random Forests Model

Here’s the code I used to train the Random Forests model (after finding out that the word “one” seemed to not be too important for the classification):

`authorship.model.rf = randomForest(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have + her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train, ntree=5000, mtry=15, importance=TRUE)`

It seemed to me that the mtry argument shouldn’t be too low, as we are trying to discriminate between authors!  Following is a graph showing the Mean Decrease in Accuracy for each of the words in the Random Forests Model:

As you can see, some of the most important words for classification in this model were “was”, “be”, “to”, “the”, “her” and “had.  At this point, I can’t help but think of the ever famous “to be, or not to be” line, and can’t help but wonder if these are the sort of words that would show up more often in Shakespeare texts.  I don’t have the original texts to re-analyze, so I can only rely on what I have in this dataset to try to answer that question.  Doing a simple average of how often the word “be” shows up per chapter per author, I see that it shows up an average of 12.9 times per scene in Shakespeare, and an average of 20.2 times per chapter in Austen!  Shakespeare doesn’t win that game!!

Anyway, the Random Forests model does pretty well in classifying authors in the test set, as you can see in the counts and proportions tables below:

```> authorship.test\$pred.author.rf = predict(authorship.model.rf, authorship.test, type="response")
> table(authorship.test\$Author, authorship.test\$pred.author.rf)
> prop.table(table(authorship.test\$Author, authorship.test\$pred.author.rf),1)

Austen London Milton Shakespeare
Austen         182      4      0           0
London           1    179      0           1
Milton           0      1     33           2
Shakespeare      1      2      0         102

Austen      London      Milton Shakespeare
Austen      0.978494624 0.021505376 0.000000000 0.000000000
London      0.005524862 0.988950276 0.000000000 0.005524862
Milton      0.000000000 0.027777778 0.916666667 0.055555556
Shakespeare 0.009523810 0.019047619 0.000000000 0.971428571```

If you look on the diagonal, you can see that the model performs pretty well across authors.  It seems to do the worst with Milton (although still pretty darned well), but I think that should be expected due to the low number of books and chapters from him.

Conditional Inference Tree

Here’s the code I used to train the Conditional Inference Tree model:

`> authorship.model.ctree = ctree(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)`

Following is the plot that shows the significant splits done by the Conditional Inference Tree:

As is painfully obvious at first glance, there are so many end nodes that seeing the different end nodes in this graph is out of the question.  Still useful however are the ovals indicating what words formed the significant splits.  Similar to the Random Forests model, we see “be” and “was” showing up as the most significant words in discriminating between authors.  Other words it splits on don’t seem to be as high up on the list in the Random Forests model, but the end goal is prediction, right?

Here is how the Conditional Inference Tree model did in predicting authorship in the test set:

```> authorship.test\$pred.author.ctree = predict(authorship.model.ctree, authorship.test, type="response")
> table(authorship.test\$Author, authorship.test\$pred.author.ctree)
> prop.table(table(authorship.test\$Author, authorship.test\$pred.author.ctree),1)

Austen London Milton Shakespeare
Austen         173      8      1           4
London          18    148      3          12
Milton           0      6     27           3
Shakespeare      6     10      5          84

Austen      London      Milton Shakespeare
Austen      0.930107527 0.043010753 0.005376344 0.021505376
London      0.099447514 0.817679558 0.016574586 0.066298343
Milton      0.000000000 0.166666667 0.750000000 0.083333333
Shakespeare 0.057142857 0.095238095 0.047619048 0.800000000```

Overall it looks like the Conditional Inference Tree model is doing a worse job predicting authorship compared with the Random Forests model (again, looking at the diagonal).  Again we see the Milton records popping up as having the lowest hit rate for classification, but I think it’s interesting/sad that only 80% of Shakespeare records were correctly classified.  Sorry old man, it looks like this model thinks you’re a little bit like Jack London, and somewhat like Jane Austen and John Milton.

Linear Discriminant Analysis

Finally we come to the real star of this particular show.  Here’s the code I used to train the model:

`> authorship.model.lda = lda(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)`

There’s a lot of summary information that the lda function spits out by default, so I won’t post it here, but I thought the matrix scatterplot of the records plotted along the 3 linear discriminants looked pretty great, so here it is:

From this plot you can see that putting all the words in the linear discriminant model seems to have led to pretty good discrimination between authors.  However, it’s in the prediction that you see this model shine:

```> authorship.test\$pred.author.lda = predict(authorship.model.lda, authorship.test, type="response")
> authorship.test\$pred.author.lda = predict(authorship.model.lda, authorship.test)\$class
> table(authorship.test\$Author, authorship.test\$pred.author.lda)
> prop.table(table(authorship.test\$Author, authorship.test\$pred.author.lda),1)

Austen London Milton Shakespeare
Austen         185      1      0           0
London           1    180      0           0
Milton           0      0     36           0
Shakespeare      0      0      0         105

Austen      London      Milton Shakespeare
Austen      0.994623656 0.005376344 0.000000000 0.000000000
London      0.005524862 0.994475138 0.000000000 0.000000000
Milton      0.000000000 0.000000000 1.000000000 0.000000000
Shakespeare 0.000000000 0.000000000 0.000000000 1.000000000```

As you can see, it performed flawlessly with Milton and Shakespeare, and almost flawlessly with Austen and London.

Looking at an explanation of LDA from dtreg I’m thinking that LDA is performing best here because the discriminant functions created are more sensitive to the boundaries between authors (defined here by stop word counts) than the binary splits made on the predictor variables in the decision tree models.  Does this hold for all cases of classification where the predictor variables are numeric, or does it break down if the normality of the predictors is grossly violated?  Feel free to give me a more experienced/learned opinion in the comments!

# Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees

Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees.  Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I’ve worked with a lot) for binary classification.

The dataset I used contains records of the survival of Titanic Passengers and such information as sex, age, fare each person paid, number of parents/children aboard, number of siblings or spouses aboard, passenger class and other fields (The titanic dataset can be retrieved from a page on Vanderbilt’s website replete with lots of datasets; look for “titanic3”).

I took one part of the dataset to train my models, and another part to test them.  The factors that I focused on were passenger class, sex, age, and number of siblings/spouses aboard.

First, let’s look at the GLM:

```titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)```

As you can see, I worked in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women).  Following is the model summary output, if you’re interested:

```> summary(titantic.survival.train)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.9342  -0.6851  -0.5481   0.5633   2.3164

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)     6.556807   0.878331   7.465 8.33e-14 ***
pclass         -1.928538   0.278324  -6.929 4.24e-12 ***
sexmale        -4.905710   0.785142  -6.248 4.15e-10 ***
age            -0.036462   0.009627  -3.787 0.000152 ***
sibsp          -0.264190   0.106684  -2.476 0.013272 *
pclass:sexmale  1.124111   0.299638   3.752 0.000176 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 858.22  on 649  degrees of freedom
Residual deviance: 618.29  on 644  degrees of freedom
AIC: 630.29

Number of Fisher Scoring iterations: 5```

So, after I used my model to predict survival probabilities on the testing portion of the dataset, I checked to see how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors.  For the GLM, 146/164 (89%) of those records scored at 50% or higher were actual survivors.  Not bad!

Now let’s move on to Random Forests:

```> titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000, importance=TRUE)

> titanic.survival.train.rf

Call:
randomForest(formula = as.factor(survived) ~ pclass + sex + age +      sibsp, data = titanic.train, ntree = 5000, importance = TRUE)
Type of random forest: classification
Number of trees: 5000
No. of variables tried at each split: 2

OOB estimate of  error rate: 22.62%
Confusion matrix:
0   1 class.error
0 370  38  0.09313725
1 109 133  0.45041322

> importance(titanic.survival.train.rf)
0          1 MeanDecreaseAccuracy MeanDecreaseGini
pclass  67.26795 125.166721            126.40379         34.69266
sex    160.52060 221.803515            224.89038         62.82490
age     70.35831  50.568619             92.67281         53.41834
sibsp   60.84056   3.343251             52.82503         14.01936```

It seems to me that the output indicates that the Random Forests model is better at creating true negatives than true positives, with regards to survival of the passengers, but when I asked for the predicted survival categories in the testing portion of my dataset, it appeared to do a pretty decent job predicting who would survive and who wouldn’t:

For the Random Forests model, 155/184 (84%) of those records predicted to survive actually did survive!  Again, not bad.

Finally, lets move on to the Conditional Tree model:

```> titanic.survival.train.ctree = ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train)

> titanic.survival.train.ctree

Conditional inference tree with 7 terminal nodes

Response:  as.factor(survived)
Inputs:  pclass, sex, age, sibsp
Number of observations:  650

1) sex == {female}; criterion = 1, statistic = 141.436
2) pclass <= 2; criterion = 1, statistic = 55.831
3) pclass <= 1; criterion = 0.988, statistic = 8.817
4)*  weights = 74      3) pclass > 1
5)*  weights = 47
2) pclass > 2
6)*  weights = 105
1) sex == {male}
7) pclass <= 1; criterion = 1, statistic = 15.095
8)*  weights = 88
7) pclass > 1
9) age <= 12; criterion = 1, statistic = 14.851
10) sibsp <= 1; criterion = 0.998, statistic = 12.062               11)*  weights = 18
10) sibsp > 1
12)*  weights = 12
9) age > 12
13)*  weights = 306```

I really happen to like the graph output by plotting the conditional tree model.  I find it pretty easy to understand.  As you can see, the model started the split of the data according to sex, which it found to be most significant, then pclass, age, and then siblings/spouses.  That being said, let’s look at how the prediction went for comparison purposes:

134/142 (94%) of records predicted to survive actually did survive!  Great!!  Now let’s bring all those prediction stats into one table for a final look:

```                     glm randomForests ctree
Actually Survived    146           155   134
Predicted to Survive 164           184   142
% Survived           89%           84%   94%```

So, in terms of finding the highest raw number of folks who actually did survive, the Random Forests model won, but in terms of having the highest true positive rate, the Conditional Tree model takes the cake!  Neat exercise!

Notes:

(1) There were numerous records without an age value.  I estimated ages using a regression model taken from those who did have ages.  Following are the coefficients for that model:

44.59238 + (-5.98582*pclass) + (.15971*fare) + (-.14141*pclass:fare)

(2) Although I used GLM scores of over .5 to classify records as survived, I could have played with that to get different results.  I really just wanted to have some fun, and not try all possibilities here:)

(3) I hope you enjoyed this post!  If you have anything to teach me about Random Forests or Conditional Trees, please leave a comment and be nice (I’m trying to expand my analysis skills here!!).

EDIT:
I realized that I should also post the false negative rate for each model, because it’s important to know how many people each model missed coding as survived. So, here are the stats:

```                           glm    rf ctree   gbm
Actually Survived          112   103   124    74
Predicted Not to Survive   495   475   517   429
False Negative Rate      22.6% 21.7%   24% 17.2%```

As you can see, the gbm model shows the lowest false negative rate, which is pretty nice! Compare that against the finding that it correctly predicted the survival of 184/230 (80%) records that were scored with a probability of 50% or higher, and that makes a pretty decent model.