The loan data typically have a higher proportion of good loans. We can achieve high accuracy just by labeling all loans as
> 100*nrow(data_test %>% filter(loan_status=="Fully.Paid"))/nrow(data_test)
For our test data, we gain 70.3% accuracy by just following the above strategy. Recall that we are yet to include the outcome of ‘Current’ loans. In a real situation, the ratio of Fully Paid loans is usually much higher so accuracy metric is not our main concern here. We will instead focus on a trade-off in identifying a default loan as an expense of mislabelling some good loans. We will look at ROC curve and pay particular focus on AUC when we train our models.
There is a disproportion in our target variable (Loan Status, too many Fully paid and very few Default loans). To solve this unbalanced data problem, we can downsample the majority class such that we have a sample with 50/50 data for the target variable. In this case, we will downsample so that the Fully Paid loans are equal to Default loans. This method tends to work well and run faster than upsampling or cost-sensitive training. Downsampling helps because, as we saw above, it’s trivial to achieve 70% accuracy in this case). Downsampling also helps in reducing data size.
Note that at the end, we aim to stack the results of various learning models (Logistic Regression, SVM, RandomForest, and Extreme Gradient Boosting (XGB)). Since the downside of downsampling is that information of majority class is discarded, we will continue to make a new downsampling data when we feed it to each model along the way. We anticipate that better result can be obtained by stacking all 4 models since it gets more information from the majority class.