To build our first model, we will tune Logistic Regression to our training dataset.
First we set the seed (to any number. we have chosen 100) so that we can reproduce our results.
Then we create a downsampled dataset called samp which contains an equal number of Default and Fully Paid loans. We can use the table() function to check that the downsampling is done correctly.
We use Elastic Net regularization, which comprises of Ridge and Lasso regularization, with cross-validation to prevent overfitting. Our goal is maximizing AUC. Learn more about Regularization here.
Due to limited computation resource, we run model tuning on small data and fixed lambda parameter. We use small fold: 3-fold cross validation. We then refit the best model with the whole data.
(Note: we put the final tuning result here instead of running through the whole process. We disable the execution of tuning code although readers can enable it back by setting eval = TRUE )
Look Inside glmTuned
We can now examine the output of the generated model.
plot(glmnTuned)
1> glmnTuned
2glmnet
31268 samples
470 predictor
52 classes:'Default','Fully.Paid'6No pre-processing
7Resampling: Cross-Validated (3 fold)8Summary of sample sizes:846,844,8469Resampling results across tuning parameters:10 alpha ROC Sens Spec
110.00000000.67549170.62615880.6261215120.11111110.67533320.62459390.6261215130.22222220.67554270.62933320.6340055140.33333330.67537290.62776090.6371725150.44444440.67638330.62776090.6356002160.55555560.67720670.62776090.6387523170.66666670.67746500.62302900.6387746180.77777780.67774570.62145670.6372023190.88888890.67804380.62303650.6372023201.00000000.67826640.62777580.635630021Tuning parameter 'lambda' was held constant at a value of 0.0122ROC was used to select the optimal model using the largest value.23The final values used for the model were alpha =1andlambda=0.01.24>25
The best penalty parameter is alpha = 1 (more weight on Ridge) with fixed shrinking lambda = 0.01. We use this parameter to retrain the whole sample.
1library(glmnet)2model = glmnet(3 x =as.matrix(samp[-getIndexsOfColumns(samp,"loan_status")]),4 y=samp$loan_status,5 alpha =1,6lambda=0.01,7 family ="binomial",8 standardize = FALSE)9
The finalized Logistic Regression model is applied to the test loan data. We look at ROC graph and AUC. We also set probability prediction cutoff at 50% (noted that the higher this value is, the more likely the loan is Fully Paid) and collect some performance metrics for a later comparison.
This tutorial is a part of the course Credit Risk Modelling in R. This is a premium course. The purchase options for the course are provided below. With this course, you get access to complete course content, source code, practical exercises, and all resources that are a part of the course.
Lifetime Premium Membership
$250
$179
Get unlimited access to all courses and premium content