- Credit Risk Modelling - Case Studies
- Classification vs. Regression Models
- Case Study - German Credit - Steps to Build a Predictive Model
- Import Credit Data Set in R
- German Credit Data : Data Preprocessing and Feature Selection in R
- Credit Modelling: Training and Test Data Sets
- Build the Predictive Model
- Logistic Regression Model in R
- Measure Model Performance in R Using ROCR Package
- Create a Confusion Matrix in R
- Credit Risk Modelling - Case Study- Lending Club Data
- Explore Loan Data in R - Loan Grade and Interest Rate
- Credit Risk Modelling - Required R Packages
- Loan Data - Training and Test Data Sets
- Data Cleaning in R - Part 1
- Data Cleaning in R - Part 2
- Data Cleaning in R - Part 3
- Data Cleaning in R - Part 5
- Remove Dimensions By Fitting Logistic Regression
- Create a Function and Prepare Test Data in R
- Building Credit Risk Model
- Credit Risk - Logistic Regression Model in R
- Support Vector Machine (SVM) Model in R
- Random Forest Model in R
- Extreme Gradient Boosting in R
- Predictive Modelling: Averaging Results from Multiple Models
- Predictive Modelling: Comparing Model Results
- How Insurance Companies Calculate Risk
Predictive Modelling: Comparing Model Results
AUC for each model and their performance when we set probability cutoff at 50% is summarised below:
> table_perf
model auc accuracy sensitivity specificity kappa
1 logistic regression 0.703 0.645 0.643 0.65 0.256
2 SVM 0.703 0.635 0.612 0.688 0.257
3 RandomForest 0.705 0.657 0.666 0.635 0.268
4 XGB 0.706 0.636 0.618 0.68 0.255
5 Ensemble 0.715 0.65 0.637 0.68 0.275
>
plot(rocCurve_logit,legacy.axes = TRUE,col="red",main="ROC compare")
plot(rocCurve_svm,legacy.axes = TRUE,col="blue",add=TRUE)
plot(rocCurve_rf,legacy.axes = TRUE,col="green",add=TRUE)
plot(rocCurve_xgb,legacy.axes = TRUE,col="orange",add=TRUE)
plot(rocCurve_ensemble,legacy.axes = TRUE,col="black",add=TRUE)
legend("bottomright",legend=c("logit","svm","rf","xbg","ensemble"),fill=c("red","blue","green","orange","black"))
Kappa statistics from all models exceed 20% by just small amount, which indicated that they perform moderately better than chance. XGB takes advantage of receiving all downsampling data and provides highest AUC. Comparing performance across models may not be valid, though, because we use different downsampling data for each model. Ensemble model doesn’t improve AUC as we expected.
We are surprised to find that Logistic regression does provide a very competitive performance. At 50% cutoff, it yields reasonable compromise between the percentage of correctly identified good loans (Sensitivity) and bad loans (Specificity) while not sacrificing Accuracy too much(recall that the naive strategy yields 72.3% accuracy). SVM with RBF kernel has lowest AUC. We can train it with only some portion of data as time complexity of the model rapidly jump up. RandomForest yields a comparable result to Logistic Regression. XGB sacrifices Sensitivity rate for Specificity(ability to recall bad loans). It may be suitable if we really want to avoid default loans. Ensemble model does tune up XGB a little bit. Given the simplicity of Logistic Regression model, and ROC graphs are, overall, not significantly difference, we recommend it as a model of choice for predicting LendingClub dataset.
Related Downloads
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $39 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.