Building Credit Risk Model

The loan data typically have a higher proportion of good loans. We can achieve high accuracy just by labeling all loans as Fully Paid.

> 100*nrow(data_test %>% filter(loan_status=="Fully.Paid"))/nrow(data_test)
[1] 70.16704
>


For our test data, we gain 70.3% accuracy by just following the above strategy. Recall that we are yet to include the outcome of 'Current' loans. In a real situation, the ratio of Fully Paid loans is usually much higher so accuracy metric is not our main concern here. We will instead focus on a trade-off in identifying a default loan as an expense of mislabelling some good loans. We will look at ROC curve and pay particular focus on AUC when we train our models.

There is a disproportion in our target variable (Loan Status, too many Fully paid and very few Default loans). To solve this unbalanced data problem, we can downsample the majority class such that we have a sample with 50/50 data for the target variable. In this case, we will downsample so that the Fully Paid loans are equal to Default loans. This method tends to work well and run faster than upsampling or cost-sensitive training. Downsampling helps because, as we saw above, it's trivial to achieve 70% accuracy in this case). Downsampling also helps in reducing data size.

Note that at the end, we aim to stack the results of various learning models (Logistic Regression, SVM, RandomForest, and Extreme Gradient Boosting (XGB)). Since the downside of downsampling is that information of majority class is discarded, we will continue to make a new downsampling data when we feed it to each model along the way. We anticipate that better result can be obtained by stacking all 4 models since it gets more information from the majority class.

Related Downloads

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $39 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.