Predictive Modelling: Comparing Model Results

AUC for each model and their performance when we set probability cutoff at 50% is summarised below:

1> table_perf
2                model   auc accuracy sensitivity specificity kappa
31 logistic regression 0.703    0.645       0.643        0.65 0.256
42                 SVM 0.703    0.635       0.612       0.688 0.257
53        RandomForest 0.705    0.657       0.666       0.635 0.268
64                 XGB 0.706    0.636       0.618        0.68 0.255
75            Ensemble 0.715     0.65       0.637        0.68 0.275
8>
9

1plot(rocCurve_logit,legacy.axes = TRUE,col="red",main="ROC compare")
2plot(rocCurve_svm,legacy.axes = TRUE,col="blue",add=TRUE)
3plot(rocCurve_rf,legacy.axes = TRUE,col="green",add=TRUE)
4plot(rocCurve_xgb,legacy.axes = TRUE,col="orange",add=TRUE)
5plot(rocCurve_ensemble,legacy.axes = TRUE,col="black",add=TRUE)
6legend("bottomright",legend=c("logit","svm","rf","xbg","ensemble"),fill=c("red","blue","green","orange","black"))
7

Kappa statistics from all models exceed 20% by just small amount, which indicated that they perform moderately better than chance. XGB takes advantage of receiving all downsampling data and provides highest AUC. Comparing performance across models may not be valid, though, because we use different downsampling data for each model. Ensemble model doesn’t improve AUC as we expected.

We are surprised to find that Logistic regression does provide a very competitive performance. At 50% cutoff, it yields reasonable compromise between the percentage of correctly identified good loans (Sensitivity) and bad loans (Specificity) while not sacrificing Accuracy too much(recall that the naive strategy yields 72.3% accuracy). SVM with RBF kernel has lowest AUC. We can train it with only some portion of data as time complexity of the model rapidly jump up. RandomForest yields a comparable result to Logistic Regression. XGB sacrifices Sensitivity rate for Specificity(ability to recall bad loans). It may be suitable if we really want to avoid default loans. Ensemble model does tune up XGB a little bit. Given the simplicity of Logistic Regression model, and ROC graphs are, overall, not significantly difference, we recommend it as a model of choice for predicting LendingClub dataset.

Learn

Resources

Predictive Modelling: Comparing Model Results

Predictive Modelling: Averaging Results from Multiple Models

How Insurance Companies Calculate Risk

Credit Risk Modelling in R

Data Science for Finance Bundle

Topics