- Credit Risk Modelling - Case Studies
- Classification vs. Regression Models
- Case Study - German Credit - Steps to Build a Predictive Model
- Import Credit Data Set in R
- German Credit Data : Data Preprocessing and Feature Selection in R
- Credit Modelling: Training and Test Data Sets
- Build the Predictive Model
- Logistic Regression Model in R
- Measure Model Performance in R Using ROCR Package
- Create a Confusion Matrix in R
- Credit Risk Modelling - Case Study- Lending Club Data
- Explore Loan Data in R - Loan Grade and Interest Rate
- Credit Risk Modelling - Required R Packages
- Loan Data - Training and Test Data Sets
- Data Cleaning in R - Part 1
- Data Cleaning in R - Part 2
- Data Cleaning in R - Part 3
- Data Cleaning in R - Part 5
- Remove Dimensions By Fitting Logistic Regression
- Create a Function and Prepare Test Data in R
- Building Credit Risk Model
- Credit Risk - Logistic Regression Model in R
- Support Vector Machine (SVM) Model in R
- Random Forest Model in R
- Extreme Gradient Boosting in R
- Predictive Modelling: Averaging Results from Multiple Models
- Predictive Modelling: Comparing Model Results
- How Insurance Companies Calculate Risk
Data Cleaning in R - Part 1
Discarding Attributes
LendingClub also provides a data dictionary that contains details of all attributes of out dataset. We can use that dictionary to understand more about the data columns we have and remove columns that may not impact the loan default.
Discard Attributes
We can use the data dictionary to identify and discard some attributes which we think are irrelevant or will have little impact on loan default.
discard_column = c("collection_recovery_fee","emp_title",
"funded_amnt_inv","id",
"installment","last_credit_pull_d",
"last_fico_range_high","last_fico_range_low",
"last_pymnt_amnt","last_pymnt_d",
"loan_amnt","member_id",
"next_pymnt_d","num_tl_120dpd_2m",
"num_tl_30dpd","out_prncp",
"out_prncp_inv","recoveries",
"total_pymnt","total_pymnt_inv",
"total_rec_int","total_rec_late_fee",
"total_rec_prncp","url",
"zip_code"
)
> data_train = (data_train[,!(names(data_train) %in% discard_column)])
> dim(data_train)
[1] 41909 122
>
Discard Grade Attribute
We will also drop the grade
attribute as the grade information is also available in sub_grade
.
> data_train$grade = NULL
Discard Columns with Too Many NAs
We will also discard columns with too many NAs. We will discard columns where the percentage is higher than 50% NAs.
> data_train <- data_train[, -which(colMeans(is.na(data_train)) > 0.5)]
For the rest of the columns in NAs, it is reasonable to impute missing values with the median value. We will store median_impute_model for further use.
To do so, we will first store the percentage of NA values in each column in tmp
variable.
> tmp = sort(sapply(data_train, function(x) sum(length(which(is.na(x)))))/nrow(data_train),decreasing = TRUE)
> median_impute_model = preProcess(data_train[names(tmp)],method="medianImpute")
> data_train = predict(median_impute_model,data_train)
Before proceeding, we will check our data once more to ensure that ensure that there are no attributes with NA values.
> sort(sapply(data_train, function(x) sum(length(which(is.na(x)))))/nrow(data_train),decreasing = TRUE)
funded_amnt term int_rate
0 0 0
sub_grade emp_length home_ownership
0 0 0
annual_inc verification_status issue_d
0 0 0
loan_status pymnt_plan desc
0 0 0
purpose title addr_state
0 0 0
dti delinq_2yrs earliest_cr_line
0 0 0
inq_last_6mths mths_since_last_delinq open_acc
0 0 0
pub_rec revol_bal revol_util
0 0 0
total_acc initial_list_status collections_12_mths_ex_med
0 0 0
policy_code application_type verification_status_joint
0 0 0
acc_now_delinq tot_coll_amt tot_cur_bal
0 0 0
open_acc_6m open_act_il open_il_12m
0 0 0
open_il_24m mths_since_rcnt_il total_bal_il
0 0 0
il_util open_rv_12m open_rv_24m
0 0 0
max_bal_bc all_util total_rev_hi_lim
0 0 0
inq_fi total_cu_tl inq_last_12m
0 0 0
acc_open_past_24mths avg_cur_bal bc_open_to_buy
0 0 0
bc_util chargeoff_within_12_mths delinq_amnt
0 0 0
mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op
0 0 0
mo_sin_rcnt_tl mort_acc mths_since_recent_bc
0 0 0
mths_since_recent_inq num_accts_ever_120_pd num_actv_bc_tl
0 0 0
num_actv_rev_tl num_bc_sats num_bc_tl
0 0 0
num_il_tl num_op_rev_tl num_rev_accts
0 0 0
num_rev_tl_bal_gt_0 num_sats num_tl_90g_dpd_24m
0 0 0
num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75
0 0 0
pub_rec_bankruptcies tax_liens tot_hi_cred_lim
0 0 0
total_bal_ex_mort total_bc_limit total_il_high_credit_limit
0 0 0
hardship_flag hardship_type hardship_reason
0 0 0
hardship_status hardship_start_date hardship_end_date
0 0 0
payment_plan_start_date hardship_loan_status disbursement_method
0 0 0
debt_settlement_flag debt_settlement_flag_date settlement_status
0 0 0
settlement_date
0
>
Transformations
Let’s take a look at the remaining features:
str(data_train)
> str(data_train)
'data.frame': 41909 obs. of 94 variables:
$ funded_amnt : int 10000 35000 14400 7250 10000 10000 25000 8400 6950 16000 ...
$ term : chr " 60 months" " 60 months" " 36 months" " 36 months" ...
$ int_rate : num 19.53 20.75 10.75 13.67 5.32 ...
$ sub_grade : chr "D5" "E2" "B4" "C3" ...
$ emp_length : chr "4 years" "3 years" "10+ years" "3 years" ...
$ home_ownership : chr "OWN" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
$ annual_inc : num 52000 85000 85000 72000 45000 ...
$ verification_status : chr "Source Verified" "Source Verified" "Not Verified" "Source Verified" ...
$ issue_d : chr "Mar-2016" "Mar-2016" "Mar-2016" "Mar-2016" ...
$ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
$ pymnt_plan : chr "n" "n" "n" "n" ...
$ desc : chr "" "" "" "" ...
$ purpose : chr "other" "debt_consolidation" "small_business" "debt_consolidation" ...
$ title : chr "Other" "Debt consolidation" "Business" "Debt consolidation" ...
$ addr_state : chr "GA" "NY" "NH" "IN" ...
$ dti : num 15 24.98 28.11 23.93 8.03 ...
$ delinq_2yrs : int 0 0 0 1 0 0 0 0 0 0 ...
$ earliest_cr_line : chr "Oct-2000" "Dec-2008" "Feb-2000" "Jan-1985" ...
$ inq_last_6mths : int 1 1 0 0 0 1 1 1 0 1 ...
$ mths_since_last_delinq : num 44 31 72 20 31 31 31 60 55 25 ...
$ open_acc : int 4 9 16 12 7 9 9 7 5 10 ...
$ pub_rec : int 2 0 0 1 1 0 0 0 4 0 ...
$ revol_bal : int 1077 10167 37582 12220 471 10139 47954 11059 7096 9891 ...
$ revol_util : chr "35.9%" "46.6%" "85.8%" "65%" ...
$ total_acc : int 12 19 33 24 15 14 22 22 26 16 ...
$ initial_list_status : chr "w" "w" "w" "w" ...
$ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 1 0 0 ...
$ policy_code : int 1 1 1 1 1 1 1 1 1 1 ...
$ application_type : chr "Individual" "Individual" "Individual" "Individual" ...
$ verification_status_joint : chr "" "" "" "" ...
$ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
$ tot_coll_amt : int 622 0 0 0 0 0 0 0 0 2017 ...
$ tot_cur_bal : int 15886 161643 190166 366483 11970 42767 431227 157051 7096 22551 ...
$ open_acc_6m : num 2 0 0 1 0 1 2 1 0 0 ...
$ open_act_il : num 1 3 3 2 1 2 2 1 0 2 ...
$ open_il_12m : num 4 0 1 1 0 1 2 1 0 2 ...
$ open_il_24m : num 5 3 1 3 1 1 2 1 0 2 ...
$ mths_since_rcnt_il : num 2 14 12 3 23 6 6 10 91 9 ...
$ total_bal_il : num 14809 73863 22387 40343 11499 ...
$ il_util : num 99 83 47 92 72 73 61 91 76 67 ...
$ open_rv_12m : num 0 0 0 1 1 1 1 2 0 3 ...
$ open_rv_24m : num 0 1 0 1 2 1 1 2 1 4 ...
$ max_bal_bc : num 1007 5109 12211 3694 325 ...
$ all_util : num 88 71 66 84 54 55 60 86 59 50 ...
$ total_rev_hi_lim : int 3000 21800 43800 18800 6400 33000 81500 14100 12000 26700 ...
$ inq_fi : num 3 5 0 0 2 2 0 1 0 1 ...
$ total_cu_tl : num 0 1 0 0 2 0 1 1 0 1 ...
$ inq_last_12m : num 2 2 0 1 0 3 4 3 1 4 ...
$ acc_open_past_24mths : int 5 4 1 4 3 2 3 4 1 6 ...
$ avg_cur_bal : int 3972 17960 11885 30540 1710 4752 47914 22436 1419 2506 ...
$ bc_open_to_buy : num 1623 4833 3393 997 4329 ...
$ bc_util : num 39.9 67.8 87.4 90.1 9.8 50.7 35.8 90.6 69.8 49.5 ...
$ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
$ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
$ mo_sin_old_il_acct : num 101 87 145 132 135 65 258 129 153 113 ...
$ mo_sin_old_rev_tl_op : int 185 65 193 374 310 65 246 129 205 184 ...
$ mo_sin_rcnt_rev_tl_op : int 25 22 26 8 10 10 4 2 17 7 ...
$ mo_sin_rcnt_tl : int 2 14 12 3 10 6 4 2 17 7 ...
$ mort_acc : int 0 1 6 4 2 0 4 2 8 0 ...
$ mths_since_recent_bc : num 25 22 32 59 10 10 4 89 31 7 ...
$ mths_since_recent_inq : num 4 5 20 9 23 6 4 2 11 0 ...
$ num_accts_ever_120_pd : int 2 0 0 3 0 0 0 2 1 2 ...
$ num_actv_bc_tl : int 2 3 5 3 2 4 3 2 3 4 ...
$ num_actv_rev_tl : int 2 3 11 5 2 4 4 4 4 6 ...
$ num_bc_sats : int 2 3 5 3 4 4 5 2 3 5 ...
$ num_bc_tl : int 3 4 9 4 5 4 7 7 7 5 ...
$ num_il_tl : int 7 9 7 8 4 4 8 2 3 7 ...
$ num_op_rev_tl : int 3 5 11 9 6 7 6 5 5 8 ...
$ num_rev_accts : int 5 9 20 12 9 10 10 12 15 9 ...
$ num_rev_tl_bal_gt_0 : int 2 3 11 5 2 4 4 4 4 6 ...
$ num_sats : int 4 9 16 12 7 9 9 7 5 10 ...
$ num_tl_90g_dpd_24m : int 0 0 0 1 0 0 0 0 0 0 ...
$ num_tl_op_past_12m : int 4 0 1 2 1 2 3 3 0 5 ...
$ pct_tl_nvr_dlq : num 83.3 100 93.9 83.3 100 100 100 86.4 91.3 93.3 ...
$ percent_bc_gt_75 : num 50 33.3 100 100 0 0 0 100 33.3 25 ...
$ pub_rec_bankruptcies : int 0 0 0 0 1 0 0 0 3 0 ...
$ tax_liens : int 0 0 0 1 0 0 0 0 1 0 ...
$ tot_hi_cred_lim : int 18000 193751 232277 374306 22363 77555 499524 166260 12000 45520 ...
$ total_bal_ex_mort : int 15886 84030 59969 52563 11970 42767 67575 30290 7096 22551 ...
$ total_bc_limit : int 2700 15000 27000 10100 4800 20000 52500 6200 9700 18800 ...
$ total_il_high_credit_limit: int 15000 87951 47352 43666 15963 44555 32024 21079 0 18820 ...
$ hardship_flag : chr "N" "N" "N" "N" ...
$ hardship_type : chr "" "" "" "" ...
$ hardship_reason : chr "" "" "" "" ...
$ hardship_status : chr "" "" "" "" ...
$ hardship_start_date : chr "" "" "" "" ...
$ hardship_end_date : chr "" "" "" "" ...
$ payment_plan_start_date : chr "" "" "" "" ...
$ hardship_loan_status : chr "" "" "" "" ...
$ disbursement_method : chr "Cash" "Cash" "Cash" "Cash" ...
$ debt_settlement_flag : chr "N" "N" "N" "N" ...
$ debt_settlement_flag_date : chr "" "" "" "" ...
$ settlement_status : chr "" "" "" "" ...
$ settlement_date : chr "" "" "" "" ...
>
Remove More Columns
discard_column = c("hardship_flag","hardship_type",
"hardship_reason","hardship_status",
"hardship_start_date","hardship_end_date",
"payment_plan_start_date","hardship_loan_status",
"disbursement_method","debt_settlement_flag",
"debt_settlement_flag_date","settlement_status",
"settlement_date"
)
> data_train = (data_train[,!(names(data_train) %in% discard_column)])
revol_util
We will convert revol_util
to numeric.
data_train$revol_util = (as.numeric(gsub(pattern = "%",replacement = "",x = data_train$int_rate)))
earliest_cr_line
earliest_cr_line
is transformed to the number of days before the loan is issued.
data_train$earliest_cr_line = parse_date_time(str_c("01",data_train$issue_d),"dmy" ) - parse_date_time(str_c("01",data_train$earliest_cr_line),"dmy" )
data_train$earliest_cr_line = as.numeric(data_train$earliest_cr_line,units = "days")
Extract Issued Month
We can see that the default rate doesn’t vary much by the month it is issued. We will drop issue_d
data_train$issue_m = sapply( data_train$issue_d ,function(x){str_split(x,"-")[[1]][1]})
tmp = data_train %>% filter(loan_status=="Default") %>% group_by(issue_m) %>% summarise(default_count = n())
tmp2 = data_train %>% group_by(issue_m) %>% summarise(count = n())
tmp2 %>% left_join(tmp) %>% mutate(default_rate = default_count/count)
Joining, by = "issue_m"
# A tibble: 3 x 4
issue_m count default_count default_rate
1 Feb 12470 3779 0.303
2 Jan 10664 3090 0.290
3 Mar 18775 5761 0.307
>
data_train$issue_m = NULL
data_train$issue_d = NULL
rm(tmp,tmp2)
Lesson Resources
Related Downloads
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $39 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.