I’ve spent so many years using and broadcasting my love for R and using Python quite minimally. Having read recently about machine learning in Python, I decided to take on a fun little ML project using Python from start to finish.

What follows below takes advantage of a neat dataset from the UCI Machine Learning Repository. The data contain Math test performance of 649 students in 2 Portuguese schools. What’s neat about this data set is that in addition to grades on the students’ 3 Math tests, they managed to collect a whole whack of demographic variables (and some behavioural) as well. That lead me to the question of how well can you predict final math test performance based on demographics and behaviour alone. In other words, who is likely to do well, and who is likely to tank?

I have to admit before I continue, I initially intended on doing this analysis in Python alone, but I actually felt lost 3 quarters of the way through and just did the whole darned thing in R. Once I had completed the analysis in R to my liking, I then went back to my Python analysis and continued until I finished to my reasonable satisfaction. For that reason, for each step in the analysis, I will show you the code I used in Python, the results, and then the same thing in R. Do not treat this as a comparison of Python’s machine learning capabilities versus R per se. Please treat this as a comparison of my understanding of how to do machine learning in Python versus R!

Without further ado, let’s start with some import statements in Python and library statements in R:

#Python Code from pandas import * from matplotlib import * import seaborn as sns sns.set_style("darkgrid") import matplotlib.pyplot as plt %matplotlib inline # I did this in ipython notebook, this makes the graphs show up inline in the notebook. import statsmodels.formula.api as smf from scipy import stats from numpy.random import uniform from numpy import arange from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error from math import sqrt mat_perf = read_csv('/home/inkhorn/Student Performance/student-mat.csv', delimiter=';')

I’d like to comment on the number of import statements I found myself writing in this python script. * Eleven*!! Is that even normal? Note the smaller number of library statements in my R code block below:

#R Code library(ggplot2) library(dplyr) library(ggthemr) library(caret) ggthemr('flat') # I love ggthemr! mat_perf = read.csv('student-mat.csv', sep = ';')

Now let’s do a quick plot of our target variable, scores on the students’ final math test, named ‘G3’.

#Python Code sns.set_palette("deep", desat=.6) sns.set_context(context='poster', font_scale=1) sns.set_context(rc={"figure.figsize": (8, 4)}) plt.hist(mat_perf.G3) plt.xticks(range(0,22,2))

**Distribution of Final Math Test Scores (“G3”)**

That looks pretty pleasing to my eyes. Now let’s see the code for the same thing in R (I know, the visual theme is different. So sue me!)

#R Code ggplot(mat_perf) + geom_histogram(aes(x=G3), binwidth=2)

You’ll notice that I didn’t need to tweak any palette or font size parameters for the R plot, because I used the very fun ggthemr package. You choose the visual theme you want, declare it early on, and then all subsequent plots will share the same theme! There is a command I’ve hidden, however, modifying the figure height and width. I set the figure size using rmarkdown, otherwise I just would have sized it manually using the export menu in the plot frame in RStudio. I think both plots look pretty nice, although I’m very partial to working with ggthemr!

**Univariate estimates of variable importance for feature selection**

Below, what I’ve done in both languages is to cycle through each variable in the dataset (excepting prior test scores) insert the variable name in a dictionary/list, and get a measure of importance of how predictive that variable is, alone, of the final math test score (variable G3). Of course if the variable is qualitative then I get an F score from an ANOVA, and if it’s quantitative then I get a t score from the regression.

In the case of Python this is achieved in both cases using the ols function from scipy’s statsmodels package. In the case of R I’ve achieved this using the aov function for qualitative and the lm function for quantitative variables. The numerical outcome, as you’ll see from the graphs, is the same.

#Python Code test_stats = {'variable': [], 'test_type' : [], 'test_value' : []} for col in mat_perf.columns[:-3]: test_stats['variable'].append(col) if mat_perf[col].dtype == 'O': # Do ANOVA aov = smf.ols(formula='G3 ~ C(' + col + ')', data=mat_perf, missing='drop').fit() test_stats['test_type'].append('F Test') test_stats['test_value'].append(round(aov.fvalue,2)) else: # Do correlation print col + '\n' model = smf.ols(formula='G3 ~ ' + col, data=mat_perf, missing='drop').fit() value = round(model.tvalues[1],2) test_stats['test_type'].append('t Test') test_stats['test_value'].append(value) test_stats = DataFrame(test_stats) test_stats.sort(columns='test_value', ascending=False, inplace=True)

#R Code test.stats = list(test.type = c(), test.value = c(), variable = c()) for (i in 1:30) { test.stats$variable[i] = names(mat_perf)[i] if (is.factor(mat_perf[,i])) { anova = summary(aov(G3 ~ mat_perf[,i], data=mat_perf)) test.stats$test.type[i] = "F test" test.stats$test.value[i] = unlist(anova)[7] } else { reg = summary(lm(G3 ~ mat_perf[,i], data=mat_perf)) test.stats$test.type[i] = "t test" test.stats$test.value[i] = reg$coefficients[2,3] } } test.stats.df = arrange(data.frame(test.stats), desc(test.value)) test.stats.df$variable = reorder(test.stats.df$variable, -test.stats.df$test.value)

And now for the graphs. Again you’ll see a bit more code for the Python graph vs the R graph. Perhaps someone will be able to show me code that doesn’t involve as many lines, or maybe it’s just the way things go with graphing in Python. Feel free to educate me 🙂

#Python Code f, (ax1, ax2) = plt.subplots(2,1, figsize=(48,18), sharex=False) sns.set_context(context='poster', font_scale=1) sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 'F Test'"), hline=.1, ax=ax1, x_order=[x for x in test_stats.query("test_type == 'F Test'")['variable']]) ax1.set_ylabel('F Values') ax1.set_xlabel('') sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 't Test'"), hline=.1, ax=ax2, x_order=[x for x in test_stats.query("test_type == 't Test'")['variable']]) ax2.set_ylabel('t Values') ax2.set_xlabel('') sns.despine(bottom=True) plt.tight_layout(h_pad=3)

#R Code ggplot(test.stats.df, aes(x=variable, y=test.value)) + geom_bar(stat="identity") + facet_grid(.~test.type , scales="free", space = "free") + theme(axis.text.x = element_text(angle = 45, vjust=.75, size=11))

As you can see, the estimates that I generated in both languages were thankfully the same. My next thought was to use only those variables with a test value (F or t) of 3.0 or higher. What you’ll see below is that this led to a pretty severe decrease in predictive power compared to being liberal with feature selection.

In reality, the feature selection I use below shouldn’t be necessary at all given the size of the data set vs the number of predictors, and the statistical method that I’m using to predict grades (random forest). What’s more is that my feature selection method in fact led me to reject certain variables which I later found to be important in my expanded models! For this reason it would be nice to investigate a scalable multivariate feature selection method (I’ve been reading a bit about boruta but am skeptical about how well it scales up) to have in my tool belt. Enough blathering, and on with the model training:

**Training the First Random Forest Model**

#Python code usevars = [x for x in test_stats.query("test_value >= 3.0 | test_value <= -3.0")['variable']] mat_perf['randu'] = np.array([uniform(0,1) for x in range(0,mat_perf.shape[0])]) mp_X = mat_perf[usevars] mp_X_train = mp_X[mat_perf['randu'] <= .67] mp_X_test = mp_X[mat_perf['randu'] > .67] mp_Y_train = mat_perf.G3[mat_perf['randu'] <= .67] mp_Y_test = mat_perf.G3[mat_perf['randu'] > .67] # for the training set cat_cols = [x for x in mp_X_train.columns if mp_X_train[x].dtype == "O"] for col in cat_cols: new_cols = get_dummies(mp_X_train[col]) new_cols.columns = col + '_' + new_cols.columns mp_X_train = concat([mp_X_train, new_cols], axis=1) # for the testing set cat_cols = [x for x in mp_X_test.columns if mp_X_test[x].dtype == "O"] for col in cat_cols: new_cols = get_dummies(mp_X_test[col]) new_cols.columns = col + '_' + new_cols.columns mp_X_test = concat([mp_X_test, new_cols], axis=1) mp_X_train.drop(cat_cols, inplace=True, axis=1) mp_X_test.drop(cat_cols, inplace=True, axis=1) rf = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2, max_features='auto', min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100, n_jobs=1, oob_score=True, random_state=None, verbose=0) rf.fit(mp_X_train, mp_Y_train)

After I got past the part where I constructed the training and testing sets (with “unimportant” variables filtered out) I ran into a real annoyance. I learned that categorical variables need to be converted to dummy variables before you do the modeling (where each level of the categorical variable gets its own variable containing 1s and 0s. 1 means that the level was present in that row and 0 means that the level was not present in that row; so called “one-hot encoding”). I suppose you could argue that this puts less computational demand on the modeling procedures, but when you’re dealing with tree based ensembles I think this is a drawback. Let’s say you have a categorical variable with 5 features, “a” through “e”. It just so happens that when you compare a split on that categorical variable where “abc” is on one side and “de” is on the other side, there is a very significant difference in the dependent variable. How is one-hot encoding going to capture that? And then, your dataset which had a certain number of columns now has 5 additional columns due to the encoding. “Blah” I say!

Anyway, as you can see above, I used the get_dummies function in order to do the one-hot encoding. Also, you’ll see that I’ve assigned two thirds of the data to the training set and one third to the testing set. Now let’s see the same steps in R:

#R Code keep.vars = match(filter(test.stats.df, abs(test.value) >= 3)$variable, names(mat_perf)) ctrl = trainControl(method="repeatedcv", number=10, selectionFunction = "oneSE") mat_perf$randu = runif(395) test = mat_perf[mat_perf$randu > .67,] trf = train(mat_perf[mat_perf$randu <= .67,keep.vars], mat_perf$G3[mat_perf$randu <= .67], method="rf", metric="RMSE", data=mat_perf, trControl=ctrl, importance=TRUE)

Wait a minute. Did I really just train a Random Forest model in R, do cross validation, and prepare a testing data set with 5 commands!?!? That was a lot easier than doing these preparations and not doing cross validation in Python! I did in fact try to figure out cross validation in sklearn, but then I was having problems accessing variable importances after. I do like the caret package 🙂 Next, let’s see how each of the models did on their testing set:

**Testing the First Random Forest Model**

#Python Code y_pred = rf.predict(mp_X_test) sns.set_context(context='poster', font_scale=1) first_test = DataFrame({"pred.G3.keepvars" : y_pred, "G3" : mp_Y_test}) sns.lmplot("G3", "pred.G3.keepvars", first_test, size=7, aspect=1.5) print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred)[0]**2 print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred))

R^2 value of 0.104940038879

RMSE of 4.66552400292

Here, as in all cases when making a prediction using sklearn, I use the predict method to generate the predicted values from the model using the testing set and then plot the prediction (“pred.G3.keepvars”) vs the actual values (“G3”) using the lmplot function. I like the syntax that the lmplot function from the seaborn package uses as it is simple and familiar to me from the R world (where the arguments consist of “X variable, Y Variable, dataset name, other aesthetic arguments). As you can see from the graph above and from the R^2 value, this model kind of sucks. Another thing I like here is the quality of the graph that seaborn outputs. It’s nice! It looks pretty modern, the text is very readable, and nothing looks edgy or pixelated in the plot. Okay, now let’s look at the code and output in R, using the same predictors.

#R Code test$pred.G3.keepvars = predict(trf, test, "raw") cor.test(test$G3, test$pred.G3.keepvars)$estimate[[1]]^2 summary(lm(test$G3 ~ test$pred.G3.keepvars))$sigma ggplot(test, aes(x=G3, y=pred.G3.keepvars)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

R^2 value of 0.198648

RMSE of 4.148194

Well, it looks like this model sucks a bit less than the Python one. Quality-wise, the plot looks super nice (thanks again, ggplot2 and ggthemr!) although by default the alpha parameter is not set to account for overplotting. The docs page for ggplot2 suggests setting alpha=.05, but for this particular data set, setting it to .5 seems to be better.

Finally for this section, let’s look at the variable importances generated for each training model:

#Python Code importances = DataFrame({'cols':mp_X_train.columns, 'imps':rf.feature_importances_}) print importances.sort(['imps'], ascending=False) cols imps 3 failures 0.641898 0 Medu 0.064586 10 sex_F 0.043548 19 Mjob_services 0.038347 11 sex_M 0.036798 16 Mjob_at_home 0.036609 2 age 0.032722 1 Fedu 0.029266 15 internet_yes 0.016545 6 romantic_no 0.013024 7 romantic_yes 0.011134 5 higher_yes 0.010598 14 internet_no 0.007603 4 higher_no 0.007431 12 paid_no 0.002508 20 Mjob_teacher 0.002476 13 paid_yes 0.002006 18 Mjob_other 0.001654 17 Mjob_health 0.000515 8 address_R 0.000403 9 address_U 0.000330

#R Code varImp(trf) ## rf variable importance ## ## Overall ## failures 100.000 ## romantic 49.247 ## higher 27.066 ## age 17.799 ## Medu 14.941 ## internet 12.655 ## sex 8.012 ## Fedu 7.536 ## Mjob 5.883 ## paid 1.563 ## address 0.000

My first observation is that it was obviously easier for me to get the variable importances in R than it was in Python. Next, you’ll certainly see the symptom of the dummy coding I had to do for the categorical variables. That’s no fun, but we’ll survive through this example analysis, right? Now let’s look which variables made it to the top:

Whereas failures, mother’s education level, sex and mother’s job made it to the top of the list for the Python model, the top 4 were different apart from failures in the R model.

With the understanding that the variable selection method that I used was inappropriate, let’s move on to training a Random Forest model using all predictors except the prior 2 test scores. Since I’ve already commented above on my thoughts about the various steps in the process, I’ll comment only on the differences in results in the remaining sections.

**Training and Testing the Second Random Forest Model**

#Python Code #aav = almost all variables mp_X_aav = mat_perf[mat_perf.columns[0:30]] mp_X_train_aav = mp_X_aav[mat_perf['randu'] <= .67] mp_X_test_aav = mp_X_aav[mat_perf['randu'] > .67] # for the training set cat_cols = [x for x in mp_X_train_aav.columns if mp_X_train_aav[x].dtype == "O"] for col in cat_cols: new_cols = get_dummies(mp_X_train_aav[col]) new_cols.columns = col + '_' + new_cols.columns mp_X_train_aav = concat([mp_X_train_aav, new_cols], axis=1) # for the testing set cat_cols = [x for x in mp_X_test_aav.columns if mp_X_test_aav[x].dtype == "O"] for col in cat_cols: new_cols = get_dummies(mp_X_test_aav[col]) new_cols.columns = col + '_' + new_cols.columns mp_X_test_aav = concat([mp_X_test_aav, new_cols], axis=1) mp_X_train_aav.drop(cat_cols, inplace=True, axis=1) mp_X_test_aav.drop(cat_cols, inplace=True, axis=1) rf_aav = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2, max_features='auto', min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100, n_jobs=1, oob_score=True, random_state=None, verbose=0) rf_aav.fit(mp_X_train_aav, mp_Y_train) y_pred_aav = rf_aav.predict(mp_X_test_aav) second_test = DataFrame({"pred.G3.almostallvars" : y_pred_aav, "G3" : mp_Y_test}) sns.lmplot("G3", "pred.G3.almostallvars", second_test, size=7, aspect=1.5) print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_aav)[0]**2 print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_aav))

R^2 value of 0.226587731888

RMSE of 4.3338674965

Compared to the first Python model, the R^2 on this one is more than doubly higher (the first R^2 was .10494) and the RMSE is 7.1% lower (the first was 4.6666254). The predicted vs actual plot confirms that the predictions still don’t look fantastic compared to the actuals, which is probably the main reason why the RMSE hasn’t decreased by so much. Now to the R code using the same predictors:

#R code trf2 = train(mat_perf[mat_perf$randu <= .67,1:30], mat_perf$G3[mat_perf$randu <= .67], method="rf", metric="RMSE", data=mat_perf, trControl=ctrl, importance=TRUE) test$pred.g3.almostallvars = predict(trf2, test, "raw") cor.test(test$G3, test$pred.g3.almostallvars)$estimate[[1]]^2 summary(lm(test$G3 ~ test$pred.g3.almostallvars))$sigma ggplot(test, aes(x=G3, y=pred.g3.almostallvars)) + geom_point() + stat_smooth() + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

R^2 value of 0.3262093

RMSE of 3.8037318

Compared to the first R model, the R^2 on this one is approximately 1.64 times higher (the first R^2 was .19865) and the RMSE is 8.3% lower (the first was 4.148194). Although this particular model is indeed doing better at predicting values in the test set than the one built in Python using the same variables, I would still hesitate to assume that the process is inherently better for this data set. Due to the randomness inherent in Random Forests, one run of the training could be lucky enough to give results like the above, whereas other times the results might even be slightly worse than what I managed to get in Python. I confirmed this, and in fact most additional runs of this model in R seemed to result in an R^2 of around .20 and an RMSE of around 4.2.

Again, let’s look at the variable importances generated for each training model:

#Python Code importances_aav = DataFrame({'cols':mp_X_train_aav.columns, 'imps':rf_aav.feature_importances_}) print importances_aav.sort(['imps'], ascending=False) cols imps 5 failures 0.629985 12 absences 0.057430 1 Medu 0.037081 41 schoolsup_yes 0.036830 0 age 0.029672 23 Mjob_at_home 0.029642 16 sex_M 0.026949 15 sex_F 0.026052 40 schoolsup_no 0.019097 26 Mjob_services 0.016354 55 romantic_yes 0.014043 51 higher_yes 0.012367 2 Fedu 0.011016 39 guardian_other 0.010715 37 guardian_father 0.006785 8 goout 0.006040 11 health 0.005051 54 romantic_no 0.004113 7 freetime 0.003702 3 traveltime 0.003341

#R Code varImp(trf2) ## rf variable importance ## ## only 20 most important variables shown (out of 30) ## ## Overall ## absences 100.00 ## failures 70.49 ## schoolsup 47.01 ## romantic 32.20 ## Pstatus 27.39 ## goout 26.32 ## higher 25.76 ## reason 24.02 ## guardian 22.32 ## address 21.88 ## Fedu 20.38 ## school 20.07 ## traveltime 20.02 ## studytime 18.73 ## health 18.21 ## Mjob 17.29 ## paid 15.67 ## Dalc 14.93 ## activities 13.67 ## freetime 12.11

Now in both cases we’re seeing that absences and failures are considered as the top 2 most important variables for predicting final math exam grade. It makes sense to me, but frankly is a little sad that the two most important variables are so negative 😦 On to to the third Random Forest model, containing everything from the second with the addition of the students’ marks on their second math exam!

**Training and Testing the Third Random Forest Model**

#Python Code #allv = all variables (except G1) allvars = range(0,30) allvars.append(31) mp_X_allv = mat_perf[mat_perf.columns[allvars]] mp_X_train_allv = mp_X_allv[mat_perf['randu'] <= .67] mp_X_test_allv = mp_X_allv[mat_perf['randu'] > .67] # for the training set cat_cols = [x for x in mp_X_train_allv.columns if mp_X_train_allv[x].dtype == "O"] for col in cat_cols: new_cols = get_dummies(mp_X_train_allv[col]) new_cols.columns = col + '_' + new_cols.columns mp_X_train_allv = concat([mp_X_train_allv, new_cols], axis=1) # for the testing set cat_cols = [x for x in mp_X_test_allv.columns if mp_X_test_allv[x].dtype == "O"] for col in cat_cols: new_cols = get_dummies(mp_X_test_allv[col]) new_cols.columns = col + '_' + new_cols.columns mp_X_test_allv = concat([mp_X_test_allv, new_cols], axis=1) mp_X_train_allv.drop(cat_cols, inplace=True, axis=1) mp_X_test_allv.drop(cat_cols, inplace=True, axis=1) rf_allv = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2, max_features='auto', min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100, n_jobs=1, oob_score=True, random_state=None, verbose=0) rf_allv.fit(mp_X_train_allv, mp_Y_train) y_pred_allv = rf_allv.predict(mp_X_test_allv) third_test = DataFrame({"pred.G3.plusG2" : y_pred_allv, "G3" : mp_Y_test}) sns.lmplot("G3", "pred.G3.plusG2", third_test, size=7, aspect=1.5) print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_allv)[0]**2 print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_allv))

R^2 value of 0.836089929903

RMSE of 2.11895794845

Obviously we have added a highly predictive piece of information here by adding the grades from their second math exam (variable name was “G2”). I was reluctant to add this variable at first because when you predict test marks with previous test marks then it prevents the model from being useful much earlier on in the year when these tests have not been administered. However, I did want to see what the model would look like when I included it anyway! Now let’s see how predictive these variables were when I put them into a model in R:

#R Code trf3 = train(mat_perf[mat_perf$randu <= .67,c(1:30,32)], mat_perf$G3[mat_perf$randu <= .67], method="rf", metric="RMSE", data=mat_perf, trControl=ctrl, importance=TRUE) test$pred.g3.plusG2 = predict(trf3, test, "raw") cor.test(test$G3, test$pred.g3.plusG2)$estimate[[1]]^2 summary(lm(test$G3 ~ test$pred.g3.plusG2))$sigma ggplot(test, aes(x=G3, y=pred.g3.plusG2)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

R^2 value of 0.9170506

RMSE of 1.3346087

Well, it appears that yet again we have a case where the R model has fared better than the Python model. I find it notable that when you look at the scatterplot for the Python model you can see what look like steps in the points as you scan your eyes from the bottom-left part of the trend line to the top-right part. It appears that the Random Forest model in R has benefitted from the tuning process and as a result the distribution of the residuals are more homoscedastic and also obviously closer to the regression line than the Python model. I still wonder how much more similar these results would be if I had carried out the Python analysis by tuning while cross validating like I did in R!

For the last time, let’s look at the variable importances generated for each training model:

#Python Code importances_allv = DataFrame({'cols':mp_X_train_allv.columns, 'imps':rf_allv.feature_importances_}) print importances_allv.sort(['imps'], ascending=False) cols imps 13 G2 0.924166 12 absences 0.075834 14 school_GP 0.000000 25 Mjob_health 0.000000 24 Mjob_at_home 0.000000 23 Pstatus_T 0.000000 22 Pstatus_A 0.000000 21 famsize_LE3 0.000000 20 famsize_GT3 0.000000 19 address_U 0.000000 18 address_R 0.000000 17 sex_M 0.000000 16 sex_F 0.000000 15 school_MS 0.000000 56 romantic_yes 0.000000 27 Mjob_services 0.000000 11 health 0.000000 10 Walc 0.000000 9 Dalc 0.000000 8 goout 0.000000

#R Code varImp(trf3) ## rf variable importance ## ## only 20 most important variables shown (out of 31) ## ## Overall ## G2 100.000 ## absences 33.092 ## failures 9.702 ## age 8.467 ## paid 7.591 ## schoolsup 7.385 ## Pstatus 6.604 ## studytime 5.963 ## famrel 5.719 ## reason 5.630 ## guardian 5.278 ## Mjob 5.163 ## school 4.905 ## activities 4.532 ## romantic 4.336 ## famsup 4.335 ## traveltime 4.173 ## Medu 3.540 ## Walc 3.278 ## higher 3.246

Now this is VERY telling, and gives me insight as to why the scatterplot from the Python model had that staircase quality to it. The R model is taking into account way more variables than the Python model. G2 obviously takes the cake in both models, but I suppose it overshadowed everything else by so much in the Python model, that for some reason it just didn’t find any use for any other variable than absences.

**Conclusion**

This was fun! For all the work I did in Python, I used IPython Notebook. Being an avid RStudio user, I’m not used to web-browser based interactive coding like what IPython Notebook provides. I discovered that I enjoy it and found it useful for laying out the information that I was using to write this blog post (I also laid out the R part of this analysis in RMarkdown for that same reason). What I did not like about IPython Notebook is that when you close it/shut it down/then later reinitialize it, all of the objects that form your data and analysis are gone and all you have left are the results. You must then re-run all of your code so that your objects are resident in memory again. It would be nice to have some kind of convenience function to save everything to disk so that you can reload at a later time.

I found myself stumbling a lot trying to figure out which Python packages to use for each particular purpose and I tended to get easily frustrated. I had to keep reminding myself that it’s a learning curve to a similar extent as it was for me while I was learning R. This frustration should not be a deterrent from picking it up and learning how to do machine learning in Python. Another part of my frustration was not being able to get variable importances from my Random Forest models in Python when I was building them using cross validation and grid searches. If you have a link to share with me that shows an example of this, I’d be happy to read it.

I liked seaborn and I think if I spend more time with it then perhaps it could serve as a decent alternative to graphing in ggplot2. That being said, I’ve spent so much time using ggplot2 that sometimes I wonder if there is anything out there that rivals its flexibility and elegance!

The issue I mentioned above with categorical variables is *annoying* and it really makes me wonder if using a Tree based R model would intrinsically be superior due to its automatic handling of categorical variables compared with Python, where you need to one-hot encode these variables.

All in all, I hope this was as useful and educational for you as it was for me. It’s important to step outside of your comfort zone every once in a while 🙂

Hi, my name is Adolfo. This is a great post and your blog is awesome! I just subscribed 😀 I’m new in Data Analysis, started about a year and ago from zero and hopefully some day I’ll be prepared in order to change careers, do you recommend any book/online course/MOOC or anything to start in Machine Learning? I already took a few courses on basic statistics. Thanks for sharing!

Hi Adolfo,

An ebook that I really loved and you might too is called Machine Learning with R, by Brett Lantz. It’s $33 USD, but worth every penny! Good luck!

Hello,

I’m a bit language agnostic, but I do think that maybe saying Python needs a lot more imports is not really fair:

% pylab inline

would get rid of many of your imports — which aren’t even that many; try doing something similar in Java 😉 . I guess each language has it’s own quirks and bonuses. Still, nice post.

As I mentioned in my post, the point was to show my progress in learning the machine learning process in Python, and not as an absolute comparison. Thank you for the tip!!

Your complaints about one-hot encoding reflect an unwillingness to accept the unordered nature of categorical variables. If abc | de is more probable than eac | bd, then it is not a categorical variable, it is a numerical variable with a restricted range, and should be encoded using {1,…,5}.

I disagree with your assessment. The example that I brought up was not to say that the features had to be in that exact order on each side of the split, in addition to being more probable than other kinds of splits. The example was to say that if there is such a split containing abc on one side (or cab / acb /etc) and de on the other where the split between the two groups is significantly predictive of the dependent variable, then one-hot would not pick that kind of a split up.

As far as I understand, the only kinds of split that you can encode with one-hot are for example “a vs. everything else”, or “e vs. everything else”, and not ones where the features are grouped together because that happens to be more significant. I hope I’ve explained myself a little better.

ah, that does make a lot more sense; thanks for explaining. Two responses:

– Are you sure that it is not implemented as one-hot under the hood?

– If so, variables that take many values will result in exponentially more computation, becoming intractable rather quickly. Your criticism of one-hot applies to trees in general, which are greedy algorithms.

Hi:

I’ve just repeated your python path with some simplifications (probably the amount of code could be further decreased …)

https://github.com/guziy/PyNotebooks/blob/master/machine_learning_tutorial.ipynb

Thanks for the post, it helps me to ease into this kinds of analysis))

Cheers

I didn’t realize that you could display IPython notebooks on github. Cool!!

There’s a lot that can be done to shorten your Python code. I would suggest looking into the Pipeline class in sklearn (to neatly compose a lot of boilerplate), and also using the function train_test_split:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Also, for dummy encoding, at some point dict_vectorizer will be your friend (you’ll thank me later):

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

I would quote ?.NotYetImplemended in response to your annoyance: “You are very welcome to contribute your code …”

Hello,

Nice post. I am an python addict and took up R the past five years, somewhat the opposite view.

I use python mainly for the ‘raw data processing’ and I use R for the statistical evaluations. Updating and adjusting the input data is fast and easy in python, can be automated and standardized easily. Doing the ‘interactive’ analytical work is performed in R on a nice and ‘standardized’ input dataset. The plot and statistical libraries in R are much more divers. I tend to report in R, but when a ‘final result’ is emerging, it can be implemented in a reduced form in python (which I do not do to often).

Now comes the interesting part. Using the rpy2 library you can use the full functionality of R, feeding it from automated processes in python and gather/report the results in python (which result again in a dataset, that is analyzed in R again…).

So, you do not have to find an alternative for the R libraries in the python libraries, you can benefit from the strongest features from both worlds. The number of libraries with similar capabilities in R is even a larger burden than in python, so the ‘library’ argument as a downside of python is not valid, rather the opposite.

And as mentioned in the other comments, translating what you know in R into python code is not always the most efficient way. I have the same problem when I do my R coding. Full object development in python is the next step to take. That’s the fun thing of python. First you use it as ‘scripting’ tool. Then you gradually shift to more efficient ‘object-oriented’ development without language/syntax issues.

Luc

Pingback: Distilled News | Data Analytics & R

Many of the predictors were binary in nature, e.g., sex, and could have been naturally expressed in binary form (e.g., sex=1 => M, sex=0 => F). Nothing is lost in the translation, the model complexity is reduced, and variable importance is simplified.

Totally true that the binary predictors could be (and indeed were as part of my process) expressed as binary/dummy columns. I’ll be honest and say that when I created dummy columns for each of the categorical features, I felt too lazy to eliminate one feature column from each set so that there were no informational redundancies/colinearities. I don’t think this mattered too much thanks to the fact that I was using random forests (vs some kind of least squares regression) but it irked me very much.

Fun reading! Thanks.

I have a love and hate relationship with both Python and R, and I wish the 2 could get married somehow (rpy2 is not simple to use and I find myself writing an R script which I then port to python/rpy2–takes time and defeats the purpose).

Python numerical/statistical libraries are written by people stronger on the language side. At the moment, the libraries are not on par with R and I doubt the purely statistical python capabilities will ever catch up 😦 . Python’s modules are easier to pick up for me as they follow the language’s style. Debugging via an IDE is a cinch. 🙂

R and its packages are written by people stronger on the statistics side, and R is more decentralized. Hence, packages could be very different making them a pain to learn. R proper is quirky; why define “stopifnot” instead of “stopif” when the former requires double negatives? Perhaps here to be consistent with S? Error messages in R can be so terse that one wastes so much time 😦 Python is more straightforward to write and, at times, even fun. RStudio is a gem, and seems to be written by people who know what slows them down using standalone R. And R has tons of good packages. 🙂

I use both python and R. It is 😦 and 🙂 at the same time.