- Credit Risk Modelling - Case Studies
- Classification vs. Regression Models
- Case Study - German Credit - Steps to Build a Predictive Model
- Import Credit Data Set in R
- German Credit Data : Data Preprocessing and Feature Selection in R
- Credit Modelling: Training and Test Data Sets
- Build the Predictive Model
- Logistic Regression Model in R
- Measure Model Performance in R Using ROCR Package
- Create a Confusion Matrix in R
- Credit Risk Modelling - Case Study- Lending Club Data
- Explore Loan Data in R - Loan Grade and Interest Rate
- Credit Risk Modelling - Required R Packages
- Loan Data - Training and Test Data Sets
- Data Cleaning in R - Part 1
- Data Cleaning in R - Part 2
- Data Cleaning in R - Part 3
- Data Cleaning in R - Part 5
- Remove Dimensions By Fitting Logistic Regression
- Create a Function and Prepare Test Data in R
- Building Credit Risk Model
- Credit Risk - Logistic Regression Model in R
- Support Vector Machine (SVM) Model in R
- Random Forest Model in R
- Extreme Gradient Boosting in R
- Predictive Modelling: Averaging Results from Multiple Models
- Predictive Modelling: Comparing Model Results
- How Insurance Companies Calculate Risk
Remove Dimensions By Fitting Logistic Regression
We will use the preProcess
function from the caret
package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value.
We can try to remove the number of dimensions further by fitting Logistic regression and investigate p-value of the coefficients. The null hypothesis is that each feature makes no contribution to the predictive model (its coefficient is zero). We then discard each feature that fails to reject the hypothesis.
trans_model = preProcess(data_train,method=c("center","scale"))
data_train = predict(trans_model, data_train)
model = lrm(loan_status ~ .,data_train)
(Required Library - caret, rms)
Let's take a look at our model.
> model
Logistic Regression Model
lrm(formula = loan_status ~ ., data = data_train)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 41909 LR chi2 6127.08 R2 0.193 C 0.731
Default 12630 d.f. 116 g 1.093 Dxy 0.463
Fully.Paid 29279 Pr(> chi2) <0.0001 gr 2.983 gamma 0.463
max |deriv| 3e-12 gp 0.194 tau-a 0.195
Brier 0.181
Coef S.E. Wald Z Pr(>|Z|)
Intercept 1.0124 0.0127 79.92 <0.0001
term 60 months -0.1358 0.0137 -9.93 <0.0001
sub_gradeA2 -0.0921 0.0232 -3.97 <0.0001
sub_gradeA3 -0.0820 0.0245 -3.35 0.0008
sub_gradeA4 -0.0910 0.0260 -3.50 0.0005
sub_gradeA5 -0.1751 0.0298 -5.88 <0.0001
sub_gradeB1 -0.2126 0.0347 -6.12 <0.0001
sub_gradeB2 -0.2266 0.0404 -5.61 <0.0001
sub_gradeB3 -0.2881 0.0452 -6.37 <0.0001
sub_gradeB4 -0.3081 0.0538 -5.73 <0.0001
sub_gradeB5 -0.3517 0.0593 -5.93 <0.0001
sub_gradeC1 -0.3983 0.0652 -6.11 <0.0001
sub_gradeC2 -0.4362 0.0696 -6.27 <0.0001
sub_gradeC3 -0.4365 0.0736 -5.93 <0.0001
sub_gradeC4 -0.4644 0.0782 -5.94 <0.0001
sub_gradeC5 -0.4586 0.0788 -5.82 <0.0001
sub_gradeD1 -0.4120 0.0782 -5.27 <0.0001
sub_gradeD2 -0.4141 0.0766 -5.41 <0.0001
sub_gradeD3 -0.3961 0.0731 -5.42 <0.0001
sub_gradeD4 -0.3982 0.0795 -5.01 <0.0001
sub_gradeD5 -0.3800 0.0765 -4.97 <0.0001
sub_gradeE1 -0.4097 0.0789 -5.19 <0.0001
sub_gradeE2 -0.3699 0.0753 -4.91 <0.0001
sub_gradeE3 -0.3582 0.0750 -4.78 <0.0001
sub_gradeE4 -0.3420 0.0747 -4.58 <0.0001
sub_gradeE5 -0.3449 0.0713 -4.84 <0.0001
sub_gradeF1 -0.3061 0.0668 -4.58 <0.0001
sub_gradeF2 -0.2922 0.0622 -4.69 <0.0001
sub_gradeF3 -0.2441 0.0544 -4.49 <0.0001
sub_gradeF4 -0.2390 0.0542 -4.41 <0.0001
sub_gradeF5 -0.2593 0.0553 -4.69 <0.0001
sub_gradeG1 -0.1844 0.0452 -4.08 <0.0001
sub_gradeG2 -0.1563 0.0391 -4.00 <0.0001
sub_gradeG3 -0.1486 0.0368 -4.04 <0.0001
sub_gradeG4 -0.1447 0.0339 -4.27 <0.0001
sub_gradeG5 -0.1375 0.0355 -3.88 0.0001
emp_length10+ years 0.0273 0.0234 1.17 0.2429
emp_length2 years -0.0005 0.0167 -0.03 0.9759
emp_length3 years 0.0031 0.0163 0.19 0.8475
emp_length4 years 0.0212 0.0152 1.39 0.1632
emp_length5 years -0.0013 0.0152 -0.09 0.9293
emp_length6 years 0.0061 0.0142 0.43 0.6702
emp_length7 years 0.0048 0.0138 0.35 0.7272
emp_length8 years 0.0003 0.0146 0.02 0.9838
emp_length9 years 0.0040 0.0143 0.28 0.7791
emp_length< 1 year 0.0003 0.0160 0.02 0.9848
emp_lengthn/a -0.1120 0.0159 -7.05 <0.0001
home_ownershipOWN -0.0288 0.0126 -2.28 0.0224
home_ownershipRENT -0.1624 0.0147 -11.04 <0.0001
annual_inc 0.0633 0.0258 2.45 0.0141
verification_statusSource Verified -0.0386 0.0144 -2.68 0.0073
verification_statusVerified -0.0457 0.0144 -3.17 0.0015
purposecredit_card -0.1365 0.0552 -2.47 0.0134
purposedebt_consolidation -0.1427 0.0649 -2.20 0.0279
purposehome_improvement -0.1006 0.0347 -2.90 0.0037
purposehouse -0.0145 0.0142 -1.02 0.3068
purposemajor_purchase -0.0729 0.0226 -3.23 0.0012
purposemedical -0.0627 0.0178 -3.53 0.0004
purposemoving -0.0400 0.0153 -2.62 0.0088
purposeother -0.0868 0.0320 -2.71 0.0066
purposerenewable_energy -0.0020 0.0116 -0.17 0.8627
purposesmall_business -0.0780 0.0166 -4.70 <0.0001
purposevacation -0.0498 0.0153 -3.26 0.0011
dti -0.1527 0.0144 -10.61 <0.0001
delinq_2yrs -0.0756 0.0187 -4.05 <0.0001
earliest_cr_line -0.0635 0.0139 -4.58 <0.0001
inq_last_6mths -0.0262 0.0148 -1.77 0.0767
mths_since_last_delinq 0.0379 0.0136 2.78 0.0054
pub_rec -0.0063 0.0326 -0.19 0.8469
revol_bal -0.0349 0.0148 -2.36 0.0182
revol_util 0.2766 0.1808 1.53 0.1260
initial_list_statusw 0.0163 0.0119 1.37 0.1711
collections_12_mths_ex_med -0.0230 0.0114 -2.03 0.0427
application_typeJoint App 0.0452 0.0116 3.90 <0.0001
acc_now_delinq 0.0180 0.0120 1.50 0.1342
tot_coll_amt 0.0154 0.0125 1.23 0.2193
open_acc_6m -0.0630 0.0174 -3.61 0.0003
open_act_il 0.0165 0.0167 0.99 0.3243
open_il_12m -0.0291 0.0156 -1.87 0.0613
mths_since_rcnt_il -0.0208 0.0137 -1.52 0.1284
total_bal_il 0.0334 0.0165 2.03 0.0425
il_util -0.0053 0.0151 -0.35 0.7259
open_rv_12m -0.0010 0.0167 -0.06 0.9540
max_bal_bc 0.1018 0.0152 6.69 <0.0001
all_util -0.0949 0.0188 -5.06 <0.0001
inq_fi -0.0578 0.0137 -4.20 <0.0001
total_cu_tl 0.0550 0.0130 4.23 <0.0001
inq_last_12m 0.0222 0.0166 1.34 0.1810
avg_cur_bal 0.1261 0.0229 5.50 <0.0001
bc_open_to_buy 0.0915 0.0206 4.44 <0.0001
chargeoff_within_12_mths -0.0159 0.0113 -1.41 0.1588
delinq_amnt -0.0136 0.0106 -1.27 0.2025
mo_sin_old_il_acct -0.0214 0.0132 -1.62 0.1045
mo_sin_rcnt_rev_tl_op 0.0223 0.0186 1.20 0.2312
mo_sin_rcnt_tl 0.0067 0.0172 0.39 0.6968
mort_acc 0.1478 0.0157 9.39 <0.0001
mths_since_recent_bc 0.0425 0.0161 2.65 0.0081
mths_since_recent_inq 0.0233 0.0144 1.62 0.1056
num_accts_ever_120_pd 0.0096 0.0153 0.63 0.5298
num_actv_bc_tl -0.1563 0.0157 -9.96 <0.0001
num_bc_tl 0.0832 0.0169 4.92 <0.0001
num_il_tl 0.0373 0.0180 2.07 0.0381
num_tl_90g_dpd_24m 0.0505 0.0162 3.12 0.0018
pct_tl_nvr_dlq 0.0310 0.0160 1.93 0.0536
percent_bc_gt_75 -0.0761 0.0150 -5.07 <0.0001
pub_rec_bankruptcies 0.0089 0.0225 0.40 0.6911
tax_liens -0.0056 0.0252 -0.22 0.8238
is_ny -0.0276 0.0117 -2.36 0.0185
is_pa -0.0312 0.0113 -2.77 0.0056
is_nj -0.0254 0.0115 -2.21 0.0269
is_oh -0.0229 0.0115 -2.00 0.0458
is_fl -0.0108 0.0118 -0.91 0.3618
is_co 0.0419 0.0125 3.35 0.0008
is_ga 0.0292 0.0121 2.42 0.0155
is_va -0.0130 0.0117 -1.11 0.2665
is_az -0.0042 0.0118 -0.36 0.7205
is_ca 0.0124 0.0125 0.99 0.3205
>
We set our two-tailed p-value cutoff at 0.01, we discard features with p-value exceed this threshold.
> tmp = as.data.frame(anova(model))
> tmp$feature = rownames(tmp)
> tmp = tmp %>% filter(P > 0.01) %>% select(feature,P)
> tmp
feature P
1 emp_length10+ years 0.24294486
2 emp_length2 years 0.97590140
3 emp_length3 years 0.84747663
4 emp_length4 years 0.16323613
5 emp_length5 years 0.92929075
6 emp_length6 years 0.67016558
7 emp_length7 years 0.72718053
8 emp_length8 years 0.98379691
9 emp_length9 years 0.77913329
10 emp_length< 1 year 0.98481178
11 home_ownershipOWN 0.02242193
12 annual_inc 0.01409361
13 purposecredit_card 0.01339868
14 purposedebt_consolidation 0.02785530
15 purposehouse 0.30682639
16 purposerenewable_energy 0.86272358
17 inq_last_6mths 0.07665570
18 pub_rec 0.84689291
19 revol_bal 0.01821084
20 revol_util 0.12604079
21 initial_list_statusw 0.17105737
22 collections_12_mths_ex_med 0.04272762
23 acc_now_delinq 0.13421249
24 tot_coll_amt 0.21928677
25 open_act_il 0.32432270
26 open_il_12m 0.06133265
27 mths_since_rcnt_il 0.12838908
28 total_bal_il 0.04248475
29 il_util 0.72592433
30 open_rv_12m 0.95403016
31 inq_last_12m 0.18102960
32 chargeoff_within_12_mths 0.15879401
33 delinq_amnt 0.20247434
34 mo_sin_old_il_acct 0.10452072
35 mo_sin_rcnt_rev_tl_op 0.23123190
36 mo_sin_rcnt_tl 0.69681695
37 mths_since_recent_inq 0.10561320
38 num_accts_ever_120_pd 0.52975942
39 num_il_tl 0.03809880
40 pct_tl_nvr_dlq 0.05360825
41 pub_rec_bankruptcies 0.69105180
42 tax_liens 0.82384179
43 is_ny 0.01850732
44 is_nj 0.02686627
45 is_oh 0.04582818
46 is_fl 0.36182627
47 is_ga 0.01551415
48 is_va 0.26648933
49 is_az 0.72047895
50 is_ca 0.32050010
>
> data_train = (data_train[,!(names(data_train) %in% tmp$feature)])
> rm(model,tmp)
Some feature names are invalid. We will replace invalid characters with "_"
colnames(data_train) = str_replace_all(colnames(data_train)," ","_")
colnames(data_train) = str_replace_all(colnames(data_train),"<","_")
colnames(data_train) = str_replace_all(colnames(data_train),"/","_")
Our data is now ready for building the predictive model.
You may find these interesting
Related Downloads
Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.