An R Enthusiast Goes Pythonic!

I’ve spent so many years using and broadcasting my love for R and using Python quite minimally. Having read recently about machine learning in Python, I decided to take on a fun little ML project using Python from start to finish.

What follows below takes advantage of a neat dataset from the UCI Machine Learning Repository.  The data contain Math test performance of 649 students in 2 Portuguese schools.  What’s neat about this data set is that in addition to grades on the students’ 3 Math tests, they managed to collect a whole whack of demographic variables (and some behavioural) as well.  That lead me to the question of how well can you predict final math test performance based on demographics and behaviour alone.  In other words, who is likely to do well, and who is likely to tank?

I have to admit before I continue, I initially intended on doing this analysis in Python alone, but I actually felt lost 3 quarters of the way through and just did the whole darned thing in R.  Once I had completed the analysis in R to my liking, I then went back to my Python analysis and continued until I finished to my reasonable satisfaction.  For that reason, for each step in the analysis, I will show you the code I used in Python, the results, and then the same thing in R.  Do not treat this as a comparison of Python’s machine learning capabilities versus R per se.  Please treat this as a comparison of my understanding of how to do machine learning in Python versus R!

Without further ado, let’s start with some import statements in Python and library statements in R:

#Python Code
from pandas import *
from matplotlib import *
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline # I did this in ipython notebook, this makes the graphs show up inline in the notebook.
import statsmodels.formula.api as smf
from scipy import stats
from numpy.random import uniform
from numpy import arange
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
mat_perf = read_csv('/home/inkhorn/Student Performance/student-mat.csv', delimiter=';')

I’d like to comment on the number of import statements I found myself writing in this python script. Eleven!! Is that even normal? Note the smaller number of library statements in my R code block below:

#R Code
library(ggplot2)
library(dplyr)
library(ggthemr)
library(caret)
ggthemr('flat') # I love ggthemr!
mat_perf = read.csv('student-mat.csv', sep = ';')

Now let’s do a quick plot of our target variable, scores on the students’ final math test, named ‘G3’.

#Python Code
sns.set_palette("deep", desat=.6)
sns.set_context(context='poster', font_scale=1)
sns.set_context(rc={"figure.figsize": (8, 4)})
plt.hist(mat_perf.G3)
plt.xticks(range(0,22,2))

Distribution of Final Math Test Scores (“G3”)
Python Hist - G3

That looks pretty pleasing to my eyes. Now let’s see the code for the same thing in R (I know, the visual theme is different. So sue me!)

#R Code
ggplot(mat_perf) + geom_histogram(aes(x=G3), binwidth=2)

Hist - G3

You’ll notice that I didn’t need to tweak any palette or font size parameters for the R plot, because I used the very fun ggthemr package. You choose the visual theme you want, declare it early on, and then all subsequent plots will share the same theme! There is a command I’ve hidden, however, modifying the figure height and width. I set the figure size using rmarkdown, otherwise I just would have sized it manually using the export menu in the plot frame in RStudio.  I think both plots look pretty nice, although I’m very partial to working with ggthemr!

Univariate estimates of variable importance for feature selection

Below, what I’ve done in both languages is to cycle through each variable in the dataset (excepting prior test scores) insert the variable name in a dictionary/list, and get a measure of importance of how predictive that variable is, alone, of the final math test score (variable G3). Of course if the variable is qualitative then I get an F score from an ANOVA, and if it’s quantitative then I get a t score from the regression.

In the case of Python this is achieved in both cases using the ols function from scipy’s statsmodels package. In the case of R I’ve achieved this using the aov function for qualitative and the lm function for quantitative variables. The numerical outcome, as you’ll see from the graphs, is the same.

#Python Code
test_stats = {'variable': [], 'test_type' : [], 'test_value' : []}

for col in mat_perf.columns[:-3]:
    test_stats['variable'].append(col)
    if mat_perf[col].dtype == 'O':
        # Do ANOVA
        aov = smf.ols(formula='G3 ~ C(' + col + ')', data=mat_perf, missing='drop').fit()
        test_stats['test_type'].append('F Test')
        test_stats['test_value'].append(round(aov.fvalue,2))
    else:
        # Do correlation
        print col + '\n'
        model = smf.ols(formula='G3 ~ ' + col, data=mat_perf, missing='drop').fit()
        value = round(model.tvalues[1],2)
        test_stats['test_type'].append('t Test')
        test_stats['test_value'].append(value)

test_stats = DataFrame(test_stats)
test_stats.sort(columns='test_value', ascending=False, inplace=True)
#R Code
test.stats = list(test.type = c(), test.value = c(), variable = c())

for (i in 1:30) {
  test.stats$variable[i] = names(mat_perf)[i]
  if (is.factor(mat_perf[,i])) {
    anova = summary(aov(G3 ~ mat_perf[,i], data=mat_perf))
    test.stats$test.type[i] = "F test"
    test.stats$test.value[i] = unlist(anova)[7]
  }
  else {
    reg = summary(lm(G3 ~ mat_perf[,i], data=mat_perf))
    test.stats$test.type[i] = "t test"
    test.stats$test.value[i] = reg$coefficients[2,3]
  }

}

test.stats.df = arrange(data.frame(test.stats), desc(test.value))
test.stats.df$variable = reorder(test.stats.df$variable, -test.stats.df$test.value)

And now for the graphs. Again you’ll see a bit more code for the Python graph vs the R graph. Perhaps someone will be able to show me code that doesn’t involve as many lines, or maybe it’s just the way things go with graphing in Python. Feel free to educate me 🙂

#Python Code
f, (ax1, ax2) = plt.subplots(2,1, figsize=(48,18), sharex=False)
sns.set_context(context='poster', font_scale=1)
sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 'F Test'"), hline=.1, ax=ax1, x_order=[x for x in test_stats.query("test_type == 'F Test'")['variable']])
ax1.set_ylabel('F Values')
ax1.set_xlabel('')

sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 't Test'"), hline=.1, ax=ax2, x_order=[x for x in test_stats.query("test_type == 't Test'")['variable']])
ax2.set_ylabel('t Values')
ax2.set_xlabel('')

sns.despine(bottom=True)
plt.tight_layout(h_pad=3)

Python Bar Plot - Univariate Estimates of Variable Importance

#R Code
ggplot(test.stats.df, aes(x=variable, y=test.value)) +
  geom_bar(stat="identity") +
  facet_grid(.~test.type ,  scales="free", space = "free") +
  theme(axis.text.x = element_text(angle = 45, vjust=.75, size=11))

Bar plot - Univariate Estimates of Variable Importance

As you can see, the estimates that I generated in both languages were thankfully the same. My next thought was to use only those variables with a test value (F or t) of 3.0 or higher. What you’ll see below is that this led to a pretty severe decrease in predictive power compared to being liberal with feature selection.

In reality, the feature selection I use below shouldn’t be necessary at all given the size of the data set vs the number of predictors, and the statistical method that I’m using to predict grades (random forest). What’s more is that my feature selection method in fact led me to reject certain variables which I later found to be important in my expanded models! For this reason it would be nice to investigate a scalable multivariate feature selection method (I’ve been reading a bit about boruta but am skeptical about how well it scales up) to have in my tool belt. Enough blathering, and on with the model training:

Training the First Random Forest Model

#Python code
usevars =  [x for x in test_stats.query("test_value >= 3.0 | test_value <= -3.0")['variable']]
mat_perf['randu'] = np.array([uniform(0,1) for x in range(0,mat_perf.shape[0])])

mp_X = mat_perf[usevars]
mp_X_train = mp_X[mat_perf['randu'] <= .67]
mp_X_test = mp_X[mat_perf['randu'] > .67]

mp_Y_train = mat_perf.G3[mat_perf['randu'] <= .67]
mp_Y_test = mat_perf.G3[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train.columns if mp_X_train[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train = concat([mp_X_train, new_cols], axis=1)

# for the testing set
cat_cols = [x for x in mp_X_test.columns if mp_X_test[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test = concat([mp_X_test, new_cols], axis=1)

mp_X_train.drop(cat_cols, inplace=True, axis=1)
mp_X_test.drop(cat_cols, inplace=True, axis=1)

rf = RandomForestRegressor(bootstrap=True,
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf.fit(mp_X_train, mp_Y_train)

After I got past the part where I constructed the training and testing sets (with “unimportant” variables filtered out) I ran into a real annoyance. I learned that categorical variables need to be converted to dummy variables before you do the modeling (where each level of the categorical variable gets its own variable containing 1s and 0s. 1 means that the level was present in that row and 0 means that the level was not present in that row; so called “one-hot encoding”). I suppose you could argue that this puts less computational demand on the modeling procedures, but when you’re dealing with tree based ensembles I think this is a drawback. Let’s say you have a categorical variable with 5 features, “a” through “e”. It just so happens that when you compare a split on that categorical variable where “abc” is on one side and “de” is on the other side, there is a very significant difference in the dependent variable. How is one-hot encoding going to capture that? And then, your dataset which had a certain number of columns now has 5 additional columns due to the encoding. “Blah” I say!

Anyway, as you can see above, I used the get_dummies function in order to do the one-hot encoding. Also, you’ll see that I’ve assigned two thirds of the data to the training set and one third to the testing set. Now let’s see the same steps in R:

#R Code
keep.vars = match(filter(test.stats.df, abs(test.value) >= 3)$variable, names(mat_perf))
ctrl = trainControl(method="repeatedcv", number=10, selectionFunction = "oneSE")
mat_perf$randu = runif(395)
test = mat_perf[mat_perf$randu > .67,]
trf = train(mat_perf[mat_perf$randu <= .67,keep.vars], mat_perf$G3[mat_perf$randu <= .67],
            method="rf", metric="RMSE", data=mat_perf,
            trControl=ctrl, importance=TRUE)

Wait a minute. Did I really just train a Random Forest model in R, do cross validation, and prepare a testing data set with 5 commands!?!? That was a lot easier than doing these preparations and not doing cross validation in Python! I did in fact try to figure out cross validation in sklearn, but then I was having problems accessing variable importances after. I do like the caret package 🙂 Next, let’s see how each of the models did on their testing set:

Testing the First Random Forest Model

#Python Code
y_pred = rf.predict(mp_X_test)
sns.set_context(context='poster', font_scale=1)
first_test = DataFrame({"pred.G3.keepvars" : y_pred, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.keepvars", first_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred))

Python Scatter Plot - First Model Pred vs Actual

R^2 value of 0.104940038879
RMSE of 4.66552400292

Here, as in all cases when making a prediction using sklearn, I use the predict method to generate the predicted values from the model using the testing set and then plot the prediction (“pred.G3.keepvars”) vs the actual values (“G3”) using the lmplot function. I like the syntax that the lmplot function from the seaborn package uses as it is simple and familiar to me from the R world (where the arguments consist of “X variable, Y Variable, dataset name, other aesthetic arguments). As you can see from the graph above and from the R^2 value, this model kind of sucks. Another thing I like here is the quality of the graph that seaborn outputs. It’s nice! It looks pretty modern, the text is very readable, and nothing looks edgy or pixelated in the plot. Okay, now let’s look at the code and output in R, using the same predictors.

#R Code
test$pred.G3.keepvars = predict(trf, test, "raw")
cor.test(test$G3, test$pred.G3.keepvars)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.G3.keepvars))$sigma
ggplot(test, aes(x=G3, y=pred.G3.keepvars)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

Scatter Plot - First Model Pred vs Actual

R^2 value of 0.198648
RMSE of 4.148194

Well, it looks like this model sucks a bit less than the Python one. Quality-wise, the plot looks super nice (thanks again, ggplot2 and ggthemr!) although by default the alpha parameter is not set to account for overplotting. The docs page for ggplot2 suggests setting alpha=.05, but for this particular data set, setting it to .5 seems to be better.

Finally for this section, let’s look at the variable importances generated for each training model:

#Python Code
importances = DataFrame({'cols':mp_X_train.columns, 'imps':rf.feature_importances_})
print importances.sort(['imps'], ascending=False)

             cols      imps
3        failures  0.641898
0            Medu  0.064586
10          sex_F  0.043548
19  Mjob_services  0.038347
11          sex_M  0.036798
16   Mjob_at_home  0.036609
2             age  0.032722
1            Fedu  0.029266
15   internet_yes  0.016545
6     romantic_no  0.013024
7    romantic_yes  0.011134
5      higher_yes  0.010598
14    internet_no  0.007603
4       higher_no  0.007431
12        paid_no  0.002508
20   Mjob_teacher  0.002476
13       paid_yes  0.002006
18     Mjob_other  0.001654
17    Mjob_health  0.000515
8       address_R  0.000403
9       address_U  0.000330
#R Code
varImp(trf)

## rf variable importance
## 
##          Overall
## failures 100.000
## romantic  49.247
## higher    27.066
## age       17.799
## Medu      14.941
## internet  12.655
## sex        8.012
## Fedu       7.536
## Mjob       5.883
## paid       1.563
## address    0.000

My first observation is that it was obviously easier for me to get the variable importances in R than it was in Python. Next, you’ll certainly see the symptom of the dummy coding I had to do for the categorical variables. That’s no fun, but we’ll survive through this example analysis, right? Now let’s look which variables made it to the top:

Whereas failures, mother’s education level, sex and mother’s job made it to the top of the list for the Python model, the top 4 were different apart from failures in the R model.

With the understanding that the variable selection method that I used was inappropriate, let’s move on to training a Random Forest model using all predictors except the prior 2 test scores. Since I’ve already commented above on my thoughts about the various steps in the process, I’ll comment only on the differences in results in the remaining sections.

Training and Testing the Second Random Forest Model

#Python Code

#aav = almost all variables
mp_X_aav = mat_perf[mat_perf.columns[0:30]]
mp_X_train_aav = mp_X_aav[mat_perf['randu'] <= .67]
mp_X_test_aav = mp_X_aav[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train_aav.columns if mp_X_train_aav[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train_aav[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train_aav = concat([mp_X_train_aav, new_cols], axis=1)
    
# for the testing set
cat_cols = [x for x in mp_X_test_aav.columns if mp_X_test_aav[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test_aav[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test_aav = concat([mp_X_test_aav, new_cols], axis=1)

mp_X_train_aav.drop(cat_cols, inplace=True, axis=1)
mp_X_test_aav.drop(cat_cols, inplace=True, axis=1)

rf_aav = RandomForestRegressor(bootstrap=True, 
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf_aav.fit(mp_X_train_aav, mp_Y_train)

y_pred_aav = rf_aav.predict(mp_X_test_aav)
second_test = DataFrame({"pred.G3.almostallvars" : y_pred_aav, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.almostallvars", second_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_aav)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_aav))

Python Scatter Plot - Second Model Pred vs Actual

R^2 value of 0.226587731888
RMSE of 4.3338674965

Compared to the first Python model, the R^2 on this one is more than doubly higher (the first R^2 was .10494) and the RMSE is 7.1% lower (the first was 4.6666254). The predicted vs actual plot confirms that the predictions still don’t look fantastic compared to the actuals, which is probably the main reason why the RMSE hasn’t decreased by so much. Now to the R code using the same predictors:

#R code
trf2 = train(mat_perf[mat_perf$randu <= .67,1:30], mat_perf$G3[mat_perf$randu <= .67],
            method="rf", metric="RMSE", data=mat_perf,
            trControl=ctrl, importance=TRUE)
test$pred.g3.almostallvars = predict(trf2, test, "raw")
cor.test(test$G3, test$pred.g3.almostallvars)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.g3.almostallvars))$sigma
ggplot(test, aes(x=G3, y=pred.g3.almostallvars)) + geom_point() + 
  stat_smooth() + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

Scatter Plot - Second Model Pred vs Actual

R^2 value of 0.3262093
RMSE of 3.8037318

Compared to the first R model, the R^2 on this one is approximately 1.64 times higher (the first R^2 was .19865) and the RMSE is 8.3% lower (the first was 4.148194). Although this particular model is indeed doing better at predicting values in the test set than the one built in Python using the same variables, I would still hesitate to assume that the process is inherently better for this data set. Due to the randomness inherent in Random Forests, one run of the training could be lucky enough to give results like the above, whereas other times the results might even be slightly worse than what I managed to get in Python. I confirmed this, and in fact most additional runs of this model in R seemed to result in an R^2 of around .20 and an RMSE of around 4.2.

Again, let’s look at the variable importances generated for each training model:

#Python Code
importances_aav = DataFrame({'cols':mp_X_train_aav.columns, 'imps':rf_aav.feature_importances_})
print importances_aav.sort(['imps'], ascending=False)

                 cols      imps
5            failures  0.629985
12           absences  0.057430
1                Medu  0.037081
41      schoolsup_yes  0.036830
0                 age  0.029672
23       Mjob_at_home  0.029642
16              sex_M  0.026949
15              sex_F  0.026052
40       schoolsup_no  0.019097
26      Mjob_services  0.016354
55       romantic_yes  0.014043
51         higher_yes  0.012367
2                Fedu  0.011016
39     guardian_other  0.010715
37    guardian_father  0.006785
8               goout  0.006040
11             health  0.005051
54        romantic_no  0.004113
7            freetime  0.003702
3          traveltime  0.003341
#R Code
varImp(trf2)

## rf variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##            Overall
## absences    100.00
## failures     70.49
## schoolsup    47.01
## romantic     32.20
## Pstatus      27.39
## goout        26.32
## higher       25.76
## reason       24.02
## guardian     22.32
## address      21.88
## Fedu         20.38
## school       20.07
## traveltime   20.02
## studytime    18.73
## health       18.21
## Mjob         17.29
## paid         15.67
## Dalc         14.93
## activities   13.67
## freetime     12.11

Now in both cases we’re seeing that absences and failures are considered as the top 2 most important variables for predicting final math exam grade. It makes sense to me, but frankly is a little sad that the two most important variables are so negative 😦 On to to the third Random Forest model, containing everything from the second with the addition of the students’ marks on their second math exam!

Training and Testing the Third Random Forest Model

#Python Code

#allv = all variables (except G1)
allvars = range(0,30)
allvars.append(31)
mp_X_allv = mat_perf[mat_perf.columns[allvars]]
mp_X_train_allv = mp_X_allv[mat_perf['randu'] <= .67]
mp_X_test_allv = mp_X_allv[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train_allv.columns if mp_X_train_allv[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train_allv[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train_allv = concat([mp_X_train_allv, new_cols], axis=1)
    
# for the testing set
cat_cols = [x for x in mp_X_test_allv.columns if mp_X_test_allv[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test_allv[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test_allv = concat([mp_X_test_allv, new_cols], axis=1)

mp_X_train_allv.drop(cat_cols, inplace=True, axis=1)
mp_X_test_allv.drop(cat_cols, inplace=True, axis=1)

rf_allv = RandomForestRegressor(bootstrap=True, 
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf_allv.fit(mp_X_train_allv, mp_Y_train)

y_pred_allv = rf_allv.predict(mp_X_test_allv)
third_test = DataFrame({"pred.G3.plusG2" : y_pred_allv, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.plusG2", third_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_allv)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_allv))

Python Scatter Plot - Third Model Pred vs Actual

R^2 value of 0.836089929903
RMSE of 2.11895794845

Obviously we have added a highly predictive piece of information here by adding the grades from their second math exam (variable name was “G2”). I was reluctant to add this variable at first because when you predict test marks with previous test marks then it prevents the model from being useful much earlier on in the year when these tests have not been administered. However, I did want to see what the model would look like when I included it anyway! Now let’s see how predictive these variables were when I put them into a model in R:

#R Code
trf3 = train(mat_perf[mat_perf$randu <= .67,c(1:30,32)], mat_perf$G3[mat_perf$randu <= .67], 
             method="rf", metric="RMSE", data=mat_perf, 
             trControl=ctrl, importance=TRUE)
test$pred.g3.plusG2 = predict(trf3, test, "raw")
cor.test(test$G3, test$pred.g3.plusG2)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.g3.plusG2))$sigma
ggplot(test, aes(x=G3, y=pred.g3.plusG2)) + geom_point() + 
  stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

Scatter Plot - Third Model Pred vs Actual

R^2 value of 0.9170506
RMSE of 1.3346087

Well, it appears that yet again we have a case where the R model has fared better than the Python model. I find it notable that when you look at the scatterplot for the Python model you can see what look like steps in the points as you scan your eyes from the bottom-left part of the trend line to the top-right part. It appears that the Random Forest model in R has benefitted from the tuning process and as a result the distribution of the residuals are more homoscedastic and also obviously closer to the regression line than the Python model. I still wonder how much more similar these results would be if I had carried out the Python analysis by tuning while cross validating like I did in R!

For the last time, let’s look at the variable importances generated for each training model:

#Python Code
importances_allv = DataFrame({'cols':mp_X_train_allv.columns, 'imps':rf_allv.feature_importances_})
print importances_allv.sort(['imps'], ascending=False)

                 cols      imps
13                 G2  0.924166
12           absences  0.075834
14          school_GP  0.000000
25        Mjob_health  0.000000
24       Mjob_at_home  0.000000
23          Pstatus_T  0.000000
22          Pstatus_A  0.000000
21        famsize_LE3  0.000000
20        famsize_GT3  0.000000
19          address_U  0.000000
18          address_R  0.000000
17              sex_M  0.000000
16              sex_F  0.000000
15          school_MS  0.000000
56       romantic_yes  0.000000
27      Mjob_services  0.000000
11             health  0.000000
10               Walc  0.000000
9                Dalc  0.000000
8               goout  0.000000
#R Code
varImp(trf3)

## rf variable importance
## 
##   only 20 most important variables shown (out of 31)
## 
##            Overall
## G2         100.000
## absences    33.092
## failures     9.702
## age          8.467
## paid         7.591
## schoolsup    7.385
## Pstatus      6.604
## studytime    5.963
## famrel       5.719
## reason       5.630
## guardian     5.278
## Mjob         5.163
## school       4.905
## activities   4.532
## romantic     4.336
## famsup       4.335
## traveltime   4.173
## Medu         3.540
## Walc         3.278
## higher       3.246

Now this is VERY telling, and gives me insight as to why the scatterplot from the Python model had that staircase quality to it. The R model is taking into account way more variables than the Python model. G2 obviously takes the cake in both models, but I suppose it overshadowed everything else by so much in the Python model, that for some reason it just didn’t find any use for any other variable than absences.

Conclusion

This was fun! For all the work I did in Python, I used IPython Notebook. Being an avid RStudio user, I’m not used to web-browser based interactive coding like what IPython Notebook provides. I discovered that I enjoy it and found it useful for laying out the information that I was using to write this blog post (I also laid out the R part of this analysis in RMarkdown for that same reason). What I did not like about IPython Notebook is that when you close it/shut it down/then later reinitialize it, all of the objects that form your data and analysis are gone and all you have left are the results. You must then re-run all of your code so that your objects are resident in memory again. It would be nice to have some kind of convenience function to save everything to disk so that you can reload at a later time.

I found myself stumbling a lot trying to figure out which Python packages to use for each particular purpose and I tended to get easily frustrated. I had to keep reminding myself that it’s a learning curve to a similar extent as it was for me while I was learning R. This frustration should not be a deterrent from picking it up and learning how to do machine learning in Python. Another part of my frustration was not being able to get variable importances from my Random Forest models in Python when I was building them using cross validation and grid searches. If you have a link to share with me that shows an example of this, I’d be happy to read it.

I liked seaborn and I think if I spend more time with it then perhaps it could serve as a decent alternative to graphing in ggplot2. That being said, I’ve spent so much time using ggplot2 that sometimes I wonder if there is anything out there that rivals its flexibility and elegance!

The issue I mentioned above with categorical variables is annoying and it really makes me wonder if using a Tree based R model would intrinsically be superior due to its automatic handling of categorical variables compared with Python, where you need to one-hot encode these variables.

All in all, I hope this was as useful and educational for you as it was for me. It’s important to step outside of your comfort zone every once in a while 🙂

Multiple Classification and Authorship of the Hebrew Bible

Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc.  I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible).

A very cursory reading of the Documentary Hypothesis indicates that the only books of the Torah supposed to be authored by one person each were Vayikra (Leviticus) and Devarim (Deuteronomy).  The remaining three appear to be a confusing hodgepodge compilation from multiple authors.  I figured that if I submitted the books of the Torah to a similar analysis, and if the Documentary Hypothesis is spot-on, then the analysis should be able to accurately classify only Vayikra and Devarim.

The theory with the following analysis (taken from the English textual world, of course) seems to be this: When someone writes a book, they write with a very particular style.  If you are going to be able to detect that style, statistically, it is convenient to detect it using function words.  Function words (“and”, “also”, “if”, “with”, etc) need to be used regardless of content, and therefore should show up throughout the text being analyzed.  Each author uses a distinct number/proportion of each of these function words, and therefore are distinguishable based on their profile of usage.

With that in mind, I started my journey.  The first steps were to find an online source of Torah text that I could easily scrape for word counts, and then to figure out which hebrew function words to look for.  For the Torah text, I relied on the inimitable Chabad.org.  They hired good rational web developer(s) to make their website, and so looping through each perek (chapter) of the Torah was a matter of copying and pasting html page numbers from their source code.

Several people told me that I’d be wanting for function words in Hebrew, as there are not as many as in English.  However, I found a good 32 of them, as listed below:

Transliteration Hebrew Function Word Rough Translation Word Count
al עַל Upon 1262
el אֶל To 1380
asher אֲשֶׁר That 1908
ca_asher כַּאֲשֶׁר As 202
et אֶת (Direct object marker) 3214
ki כִּי For/Because 1030
col וְכָל + כָּל + לְכָל + בְּכָל + כֹּל All 1344
ken כֵּן Yes/So 124
lachen לָכֵן Therefore 6
hayah_and_variants תִּהְיֶינָה + תִּהְיֶה + וְהָיוּ + הָיוּ + יִהְיֶה + וַתְּהִי + יִּהְיוּ + וַיְהִי + הָיָה Be 819
ach אַךְ But 64
byad בְּיַד By 32
gam גַם Also/Too 72
mehmah מֶה + מָה What 458
haloh הֲלֹא Was not? 17
rak רַק Only 59
b_ad בְּעַד For the sake of 5
loh לֹא No/Not 1338
im אִם If 332
al2 אַל Do not 217
ele אֵלֶּה These 264
haheehoo הַהִוא + הַהוּא That 121
ad עַד Until 324
hazehzot הַזֶּה + הַזֹּאת + זֶה + זֹאת This 474
min מִן From 274
eem עִם With 80
mi מִי Who 703
oh אוֹ Or 231
maduah מַדּוּעַ Why 10
etzel אֵצֶל Beside 6
heehoo הִוא + הוּא + הִיא Him/Her/It 653
az אָז Thus 49

This list is not exhaustive, but definitely not small!  My one hesitation when coming up with this list surrounds the Hebrew word for “and”.  “And” takes the form of a single letter that attaches to the beginning of a word (a “vav” marked with a different vowel sound depending on its context), which I was afraid to try to extract because I worried that if I tried to count it, I would mistakenly count other vav’s that were a valid part of a word with a different meaning.  It’s a very frequent word, as you can imagine, and its absence might very well affect my subsequent analyses.

Anyhow, following is the structure of Torah:

‘Chumash’ / Book Number of Chapters
‘Bereishit’ / Genesis 50
‘Shemot’ / Exodus 40
‘Vayikra’ / Leviticus 27
‘Bamidbar’ / Numbers 36
‘Devarim’ / Deuteronomy 34

Additionally, I’ve included a faceted histogram below showing the distribution of word-counts per chapter by chumash/book of the Torah:

m = ggplot(torah, aes(x=wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Chumash

You can see that the books are not entirely different in terms of word counts of the component chapters, except for the possibility of Vayikra, which seems to tend towards the shorter chapters.

After making a Python script to count the above words within each chapter of each book, I loaded it up into R and split it into a training and testing sample:

torah$randu = runif(187, 0,1)
torah.train = torah[torah$randu <= .4,] torah.test = torah[torah$randu > .4,]

For this analysis, it seemed that using Random Forests made the most sense.  However, I wasn’t quite sure if I should use the raw counts, or proportions, so I tried both. After whittling down the variables in both models, here are the final training model definitions:

torah.rf = randomForest(chumash ~ al + el + asher + caasher + et + ki + hayah + gam + mah + loh + haheehoo + oh + heehoo, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

torah.rf.props = randomForest(chumash ~ al_1 + el_1 + asher_1 + caasher_1 + col_1 + hayah_1 + gam_1 + mah_1 + loh_1 + im_1 + ele_1 + mi_1 + oh_1 + heehoo_1, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

As you can see, the final models were mostly the same, but with a few differences. Following are the variable importances from each Random Forests model:

> importance(torah.rf)

 Word MeanDecreaseAccuracy MeanDecreaseGini
hayah 31.05139 5.979567
heehoo 20.041149 4.805793
loh 18.861843 6.244709
mah 18.798385 4.316487
al 16.85064 5.038302
caasher 15.101464 3.256955
et 14.708421 6.30228
asher 14.554665 5.866929
oh 13.585169 2.38928
el 13.010169 5.605561
gam 5.770484 1.652031
ki 5.489 4.005724
haheehoo 2.330776 1.375457

> importance(torah.rf.props)

Word MeanDecreaseAccuracy MeanDecreaseGini
asher_1 37.074235 6.791851
heehoo_1 29.87541 5.544782
al_1 26.18609 5.365927
el_1 17.498034 5.003144
col_1 17.051121 4.530049
hayah_1 16.512206 5.220164
loh_1 15.761723 5.157562
ele_1 14.795885 3.492814
mi_1 12.391427 4.380047
gam_1 12.209273 1.671199
im_1 11.386682 2.651689
oh_1 11.336546 1.370932
mah_1 9.133418 3.58483
caasher_1 5.135583 2.059358

It’s funny that the results, from a raw numbers perspective, show that hayah, the hebrew verb for “to be”, shows at the top of the list.  That’s the same result as in the Shakespeare et al. analysis!  Having established that all variables in each model had some kind of an effect on the classification, the next task was to test each model on the testing sample, and see how well each chumash/book of the torah could be classified by that model:

> torah.test$pred.chumash = predict(torah.rf, torah.test, type="response")
> torah.test$pred.chumash.props = predict(torah.rf.props, torah.test, type="response")

> xtabs(~torah.test$chumash + torah.test$pred.chumash)
                  torah.test$pred.chumash
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            4            5          2         8          7
       'Bereishit'           1           14          1        14          2
       'Devarim'             1            2         17         0          1
       'Shemot'              2            4          4         9          2
       'Vayikra'             5            0          4         0          5

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash),1)
                  torah.test$pred.chumash
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.15384615   0.19230769 0.07692308 0.30769231 0.26923077
       'Bereishit'  0.03125000   0.43750000 0.03125000 0.43750000 0.06250000
       'Devarim'    0.04761905   0.09523810 0.80952381 0.00000000 0.04761905
       'Shemot'     0.09523810   0.19047619 0.19047619 0.42857143 0.09523810
       'Vayikra'    0.35714286   0.00000000 0.28571429 0.00000000 0.35714286

> xtabs(~torah.test$chumash + torah.test$pred.chumash.props)
                  torah.test$pred.chumash.props
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            0            5          0        13          8
       'Bereishit'           1           16          0        13          2
       'Devarim'             0            2         11         4          4
       'Shemot'              1            4          2        13          1
       'Vayikra'             3            3          0         0          8

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash.props),1)
                  torah.test$pred.chumash.props
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.00000000   0.19230769 0.00000000 0.50000000 0.30769231
       'Bereishit'  0.03125000   0.50000000 0.00000000 0.40625000 0.06250000
       'Devarim'    0.00000000   0.09523810 0.52380952 0.19047619 0.19047619
       'Shemot'     0.04761905   0.19047619 0.09523810 0.61904762 0.04761905
       'Vayikra'    0.21428571   0.21428571 0.00000000 0.00000000 0.57142857

So, from the perspective of the raw number of times each function word was used, Devarim, or Deuteronomy, seemed to be the most internally consistent, with 81% of the chapters in the testing sample correctly classified. Interestingly, from the perspective of the proportion of times each function word was used, we see that Devarim, Shemot, and Vayikra (Deuteronomy, Exodus, and Leviticus) had over 50% of their chapters correctly classified in the training sample.

I’m tempted to say here, at the least, that there is evidence that at least Devarim was written with one stylistic framework in mind, and potentially one distinct author. From a proportion point of view, it appears that Shemot and Vayikra also show an internal consistency suggestive of close to one stylistic framework, or possibly a distinct author for each book. I’m definitely skeptical of my own analysis, but what do you think?

The last part of this analysis comes from a suggestion given to me by a friend, which was that once I modelled the unique profiles of function words within each book of the Torah, I should use that model on some post-Biblical texts.  Apparently one idea is that the “Deuteronomist Source” was also behind the writing of Joshua, Judges, and Kings.  If the same author was behind all four books, then when I train my model on these books, they should tend to be classified by my model as Devarim/Deuteronomy, moreso than other books.

As above, below I show the distribution of word count by book, for comparison’s sake:

> m = ggplot(neviim, aes(wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Word Count Dist by Prophets Book

Interestingly, it seems as though chapters in these particular post-Biblical texts seem to be a bit longer, on average, than those in the Torah.

Next, I gathered counts of the same function words in Joshua, Judges, and Kings as I had for the 5 books of the Torah, and tested my random forests Torah model on them.  As you can see below, the result was anything but clear on that matter:

> xtabs(~neviim$chumash + neviim$pred.chumash)
              neviim$pred.chumash
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           3            7          7         6          1
      'Judges'           2           11          2         6          0
      'Kings'            0            8          3        10          1

> xtabs(~neviim$chumash + neviim$pred.chumash.props)
              neviim$pred.chumash.props
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           2            8          6         7          1
      'Judges'           0            9          2         9          1
      'Kings'            0            7          6         7          2

I didn’t even bother to re-express this table into fractions, because it’s quite clear that each  book of the prophets that I analyzed didn’t seem to be clearly classified in any one category.  Looking at these tables, there doesn’t yet seem to me to be any evidence, from this analysis, that whoever authored Devarim/Deuteronomy also authored these post-biblical texts.

Conclusion

I don’t think that this has been a full enough analysis.  There are a few things in it that bother me, or make me wonder, that I’d love input on.  Let me list those things:

  1. As mentioned above, I’m missing the inclusion of the Hebrew “and” in this analysis.  I’d like to know how to extract counts of the Hebrew “and” without extracting counts of the Hebrew letter “vav” where it doesn’t signify “and”.
  2. Similar to my exclusion of “and”, there are a few one letter prepositions that I have not included as individual predictor variables.  Namely ל, ב, כ, מ, signifying “to”, “in”/”with”, “like”, and “from”.  How do I count these without counting the same letters that begin a different word and don’t mean the same thing?
  3. Is it more valid to consider the results of my analyses that were done on the word frequencies as proportions (individual word count divided by total number of words in the chapter), or are both valid?
  4. Does a list exist somewhere that details, chapter-by-chapter, which source is believed to have written the Torah text, according to the Documentary Hypothesis, or some more modern incarnation of the Hypothesis?  I feel that if I were able to categorize the chapters specifically, rather than just attributing them to the whole book (as a proxy of authorship), then the analysis might be a lot more interesting.

All that being said, I’m intrigued that when you look at the raw number of how often the function words were used, Devarim/Deuteronomy seems to be the most internally consistent.  If you’d like, you can look at the python code that I used to scrape the chabad.org website here: python code for scraping, although please forgive the rudimentary coding!  You can get the dataset that I collected for the Torah word counts here: Torah Data Set, and the data set that I collected for the Prophetic text word counts here: Neviim data set.  By all means, do the analysis yourself and tell me how to do it better 🙂

My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

After the work I did for my last post, I wanted to practice doing multiple classification.  I first thought of using the famous iris dataset, but felt that was a little boring.  Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors.  Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books.  In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!!  The following table describes the numerical breakdown of books and chapters from each author:

# Books # Chapters/Scenes
Austen 5 317
London 6 296
Milton 2 55
Shakespeare 12 173

Overall, the dataset contains 841 rows and 71 columns.  69 of those columns are the counted stop words (wow!), 1 is for what’s called the “BookID”, and the last is for the Author.  I hope that the word counts are the number of times each word shows up per 100 words, or something that makes the counts comparable between authors/books.

The first thing I did after loading up the data into R was to create a training and testing set:

> authorship$randu = runif(841, 0,1)
> authorship.train = authorship[authorship$randu < .4,]
> authorship.test = authorship[authorship$randu >= .4,]

Then I set out to try to predict authorship in the testing data set using a Random Forests model, a Conditional Inference Tree model, and a Linear Discriminant Analysis model.

Random Forests Model

Here’s the code I used to train the Random Forests model (after finding out that the word “one” seemed to not be too important for the classification):

authorship.model.rf = randomForest(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have + her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train, ntree=5000, mtry=15, importance=TRUE)

It seemed to me that the mtry argument shouldn’t be too low, as we are trying to discriminate between authors!  Following is a graph showing the Mean Decrease in Accuracy for each of the words in the Random Forests Model:

Discriminating Words - RF

As you can see, some of the most important words for classification in this model were “was”, “be”, “to”, “the”, “her” and “had.  At this point, I can’t help but think of the ever famous “to be, or not to be” line, and can’t help but wonder if these are the sort of words that would show up more often in Shakespeare texts.  I don’t have the original texts to re-analyze, so I can only rely on what I have in this dataset to try to answer that question.  Doing a simple average of how often the word “be” shows up per chapter per author, I see that it shows up an average of 12.9 times per scene in Shakespeare, and an average of 20.2 times per chapter in Austen!  Shakespeare doesn’t win that game!!

Anyway, the Random Forests model does pretty well in classifying authors in the test set, as you can see in the counts and proportions tables below:

> authorship.test$pred.author.rf = predict(authorship.model.rf, authorship.test, type="response")
> table(authorship.test$Author, authorship.test$pred.author.rf)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.rf),1)

              Austen London Milton Shakespeare
  Austen         182      4      0           0
  London           1    179      0           1
  Milton           0      1     33           2
  Shakespeare      1      2      0         102

                   Austen      London      Milton Shakespeare
  Austen      0.978494624 0.021505376 0.000000000 0.000000000
  London      0.005524862 0.988950276 0.000000000 0.005524862
  Milton      0.000000000 0.027777778 0.916666667 0.055555556
  Shakespeare 0.009523810 0.019047619 0.000000000 0.971428571

If you look on the diagonal, you can see that the model performs pretty well across authors.  It seems to do the worst with Milton (although still pretty darned well), but I think that should be expected due to the low number of books and chapters from him.

Conditional Inference Tree

Here’s the code I used to train the Conditional Inference Tree model:

> authorship.model.ctree = ctree(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)

Following is the plot that shows the significant splits done by the Conditional Inference Tree:

Discriminating Words - CTree

As is painfully obvious at first glance, there are so many end nodes that seeing the different end nodes in this graph is out of the question.  Still useful however are the ovals indicating what words formed the significant splits.  Similar to the Random Forests model, we see “be” and “was” showing up as the most significant words in discriminating between authors.  Other words it splits on don’t seem to be as high up on the list in the Random Forests model, but the end goal is prediction, right?

Here is how the Conditional Inference Tree model did in predicting authorship in the test set:

> authorship.test$pred.author.ctree = predict(authorship.model.ctree, authorship.test, type="response")
> table(authorship.test$Author, authorship.test$pred.author.ctree)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.ctree),1)

              Austen London Milton Shakespeare
  Austen         173      8      1           4
  London          18    148      3          12
  Milton           0      6     27           3
  Shakespeare      6     10      5          84

                   Austen      London      Milton Shakespeare
  Austen      0.930107527 0.043010753 0.005376344 0.021505376
  London      0.099447514 0.817679558 0.016574586 0.066298343
  Milton      0.000000000 0.166666667 0.750000000 0.083333333
  Shakespeare 0.057142857 0.095238095 0.047619048 0.800000000

Overall it looks like the Conditional Inference Tree model is doing a worse job predicting authorship compared with the Random Forests model (again, looking at the diagonal).  Again we see the Milton records popping up as having the lowest hit rate for classification, but I think it’s interesting/sad that only 80% of Shakespeare records were correctly classified.  Sorry old man, it looks like this model thinks you’re a little bit like Jack London, and somewhat like Jane Austen and John Milton.

Linear Discriminant Analysis

Finally we come to the real star of this particular show.  Here’s the code I used to train the model:

> authorship.model.lda = lda(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)

There’s a lot of summary information that the lda function spits out by default, so I won’t post it here, but I thought the matrix scatterplot of the records plotted along the 3 linear discriminants looked pretty great, so here it is:

linear discriminants for authorship

From this plot you can see that putting all the words in the linear discriminant model seems to have led to pretty good discrimination between authors.  However, it’s in the prediction that you see this model shine:

> authorship.test$pred.author.lda = predict(authorship.model.lda, authorship.test, type="response")
> authorship.test$pred.author.lda = predict(authorship.model.lda, authorship.test)$class
> table(authorship.test$Author, authorship.test$pred.author.lda)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.lda),1)

              Austen London Milton Shakespeare
  Austen         185      1      0           0
  London           1    180      0           0
  Milton           0      0     36           0
  Shakespeare      0      0      0         105

                   Austen      London      Milton Shakespeare
  Austen      0.994623656 0.005376344 0.000000000 0.000000000
  London      0.005524862 0.994475138 0.000000000 0.000000000
  Milton      0.000000000 0.000000000 1.000000000 0.000000000
  Shakespeare 0.000000000 0.000000000 0.000000000 1.000000000

As you can see, it performed flawlessly with Milton and Shakespeare, and almost flawlessly with Austen and London.

Looking at an explanation of LDA from dtreg I’m thinking that LDA is performing best here because the discriminant functions created are more sensitive to the boundaries between authors (defined here by stop word counts) than the binary splits made on the predictor variables in the decision tree models.  Does this hold for all cases of classification where the predictor variables are numeric, or does it break down if the normality of the predictors is grossly violated?  Feel free to give me a more experienced/learned opinion in the comments!

Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees

Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees.  Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I’ve worked with a lot) for binary classification.

The dataset I used contains records of the survival of Titanic Passengers and such information as sex, age, fare each person paid, number of parents/children aboard, number of siblings or spouses aboard, passenger class and other fields (The titanic dataset can be retrieved from a page on Vanderbilt’s website replete with lots of datasets; look for “titanic3”).

I took one part of the dataset to train my models, and another part to test them.  The factors that I focused on were passenger class, sex, age, and number of siblings/spouses aboard.

First, let’s look at the GLM:

titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)

As you can see, I worked in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women).  Following is the model summary output, if you’re interested:

> summary(titantic.survival.train)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex + age + sibsp, 
    family = binomial(logit), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9342  -0.6851  -0.5481   0.5633   2.3164  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     6.556807   0.878331   7.465 8.33e-14 ***
pclass         -1.928538   0.278324  -6.929 4.24e-12 ***
sexmale        -4.905710   0.785142  -6.248 4.15e-10 ***
age            -0.036462   0.009627  -3.787 0.000152 ***
sibsp          -0.264190   0.106684  -2.476 0.013272 *  
pclass:sexmale  1.124111   0.299638   3.752 0.000176 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 858.22  on 649  degrees of freedom
Residual deviance: 618.29  on 644  degrees of freedom
AIC: 630.29

Number of Fisher Scoring iterations: 5

So, after I used my model to predict survival probabilities on the testing portion of the dataset, I checked to see how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors.  For the GLM, 146/164 (89%) of those records scored at 50% or higher were actual survivors.  Not bad!

Now let’s move on to Random Forests:

> titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000, importance=TRUE)

> titanic.survival.train.rf

Call:
 randomForest(formula = as.factor(survived) ~ pclass + sex + age +      sibsp, data = titanic.train, ntree = 5000, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 5000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 22.62%
Confusion matrix:
    0   1 class.error
0 370  38  0.09313725
1 109 133  0.45041322

> importance(titanic.survival.train.rf)
               0          1 MeanDecreaseAccuracy MeanDecreaseGini
pclass  67.26795 125.166721            126.40379         34.69266
sex    160.52060 221.803515            224.89038         62.82490
age     70.35831  50.568619             92.67281         53.41834
sibsp   60.84056   3.343251             52.82503         14.01936

It seems to me that the output indicates that the Random Forests model is better at creating true negatives than true positives, with regards to survival of the passengers, but when I asked for the predicted survival categories in the testing portion of my dataset, it appeared to do a pretty decent job predicting who would survive and who wouldn’t:

For the Random Forests model, 155/184 (84%) of those records predicted to survive actually did survive!  Again, not bad.

Finally, lets move on to the Conditional Tree model:

> titanic.survival.train.ctree = ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train)

> titanic.survival.train.ctree

	 Conditional inference tree with 7 terminal nodes

Response:  as.factor(survived) 
Inputs:  pclass, sex, age, sibsp 
Number of observations:  650 

1) sex == {female}; criterion = 1, statistic = 141.436
  2) pclass <= 2; criterion = 1, statistic = 55.831
    3) pclass <= 1; criterion = 0.988, statistic = 8.817       
     4)*  weights = 74      3) pclass > 1
      5)*  weights = 47 
  2) pclass > 2
    6)*  weights = 105 
1) sex == {male}
  7) pclass <= 1; criterion = 1, statistic = 15.095     
   8)*  weights = 88    
  7) pclass > 1
    9) age <= 12; criterion = 1, statistic = 14.851
      10) sibsp <= 1; criterion = 0.998, statistic = 12.062               11)*  weights = 18        
      10) sibsp > 1
       12)*  weights = 12 
    9) age > 12
      13)*  weights = 306

Titanic Conditional Tree

I really happen to like the graph output by plotting the conditional tree model.  I find it pretty easy to understand.  As you can see, the model started the split of the data according to sex, which it found to be most significant, then pclass, age, and then siblings/spouses.  That being said, let’s look at how the prediction went for comparison purposes:

134/142 (94%) of records predicted to survive actually did survive!  Great!!  Now let’s bring all those prediction stats into one table for a final look:

                     glm randomForests ctree
Actually Survived    146           155   134
Predicted to Survive 164           184   142
% Survived           89%           84%   94%

So, in terms of finding the highest raw number of folks who actually did survive, the Random Forests model won, but in terms of having the highest true positive rate, the Conditional Tree model takes the cake!  Neat exercise!

Notes:

(1) There were numerous records without an age value.  I estimated ages using a regression model taken from those who did have ages.  Following are the coefficients for that model:

44.59238 + (-5.98582*pclass) + (.15971*fare) + (-.14141*pclass:fare)

(2) Although I used GLM scores of over .5 to classify records as survived, I could have played with that to get different results.  I really just wanted to have some fun, and not try all possibilities here:)

(3) I hope you enjoyed this post!  If you have anything to teach me about Random Forests or Conditional Trees, please leave a comment and be nice (I’m trying to expand my analysis skills here!!).

EDIT:
I realized that I should also post the false negative rate for each model, because it’s important to know how many people each model missed coding as survived. So, here are the stats:

                           glm    rf ctree   gbm
Actually Survived          112   103   124    74
Predicted Not to Survive   495   475   517   429
False Negative Rate      22.6% 21.7%   24% 17.2%

As you can see, the gbm model shows the lowest false negative rate, which is pretty nice! Compare that against the finding that it correctly predicted the survival of 184/230 (80%) records that were scored with a probability of 50% or higher, and that makes a pretty decent model.

Using R from Inside Statistica

I’ve been spending a lot of time in the last month or so doing projects at work not statistics related, hence the lack of posts!  In the interim, I had to do some serious research on handling datasets bigger than the last one I worked with (the one that kept threatening to max out my 8 gigs of RAM!).  I kept trying to practice working with R packages like bigmemory and ffdf, but nothing was completely satisfying my need to be able to handle a big dataset with different data types in different columns.  So, after reading up on different commercial stats packages, I determined that getting Statistica would be best for my supervisor and I (she’s insanely busy and wouldn’t have the time for the learning curve to learn Revolution R, if we were to buy that).

In speaking with my supervisor about Statistica, she mentioned that it can interface with R.  So once we got our copies of Version 11 Advanced, I went ahead and learned how the interface works.

Setup/Installation: The setup and installation of the R integration was really annoying.  There is a COM server application you have to download and install.  You have to make sure you run the installation in administrator mode.  Then you have to make sure that R is installed using administrator mode.  You have to make sure you get the rscproxy package in R and that it is installed in the R Home directory that sits in your program files folder.  It was quite a hassle.  Statistica put a white paper on their website explaining the process.

Memory Usage:  When you actively use the R integration in Statistica, take a look at your memory usage (I’m using a windows 7 computer for work).  What you will notice is that any time you run an R function in statistica, the R connector program starts taking up more and more memory, representing the fact that data is being passed from Statistica to R to be processed.  The upshot of this is that you should probably be careful how much data you’re passing to an R procedure from Statistica so that you don’t max out your memory.

Syntax: Check out the screenshot below.  Typing in R syntax into Statistica is, thankfully, pretty easy.  As you can see in the screenshot, if you want to access the active dataset to do something with it, you treat it as a dataframe labelled ActiveDataSet, and then you can use the $ sign and type the variable name of your statistica dataset like you would with R.  The only catch seems to be variables with spaces in them.  So for those variables it seems that you have to resort to referring to them by their column numbers, instead of name.

Functionality: So far, it looks like data only flows from the Statistica spreadsheet, to R, back to the Statistica report output, or a new Statistica spreadsheet.  It would be nice if I could modify data from R within a spreadsheet, but that seems to be out of the question.

Main advantage: Being a commercial product, the good folks at Statsoft aren’t just going to give you the product with all of the statistical procedures they came up with for free.  For example, since I now have Statistica Advanced, it does allow me to do some cool multivariate procedures, but I can’t generate random forests unless I get Statistica Data Miner.  The advantage that the R integration brings then, is allowing me to have advanced statistical procedures, like Random Forests, or even graphing abilities like ggplot2, without having to pay extra.  I show an example of having used a random forest procedure in Statistica using R in the screenshot above.