Remove Dimensions By Fitting Logistic Regression

Premium

We will use the preProcess function from the caret package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value.

We can try to remove the number of dimensions further by fitting Logistic regression and investigate p-value of the coefficients. The null hypothesis is that each feature makes no contribution to the predictive model (its coefficient is zero). We then discard each feature that fails to reject the hypothesis.

1trans_model = preProcess(data_train,method=c("center","scale"))
2data_train = predict(trans_model, data_train)
3model = lrm(loan_status ~ .,data_train)
4

(Required Library - caret, rms)

Let's take a look at our model.

1> model
2Logistic Regression Model
3 lrm(formula = loan_status ~ ., data = data_train)
4                       Model Likelihood     Discrimination    Rank Discrim.    
5                          Ratio Test           Indexes           Indexes       
6 Obs         41909    LR chi2    6127.08    R2       0.193    C       0.731    
7  Default    12630    d.f.           116    g        1.093    Dxy     0.463    
8  Fully.Paid 29279    Pr(> chi2) <0.0001    gr       2.983    gamma   0.463    
9 max |deriv| 3e-12                          gp       0.194    tau-a   0.195    
10                                            Brier    0.181                     
11                                    Coef    S.E.   Wald Z Pr(>|Z|)
12 Intercept                           1.0124 0.0127  79.92 <0.0001 
13 term 60 months                     -0.1358 0.0137  -9.93 <0.0001 
14 sub_gradeA2                        -0.0921 0.0232  -3.97 <0.0001 
15 sub_gradeA3                        -0.0820 0.0245  -3.35 0.0008  
16 sub_gradeA4                        -0.0910 0.0260  -3.50 0.0005  
17 sub_gradeA5                        -0.1751 0.0298  -5.88 <0.0001 
18 sub_gradeB1                        -0.2126 0.0347  -6.12 <0.0001 
19 sub_gradeB2                        -0.2266 0.0404  -5.61 <0.0001 
20 sub_gradeB3                        -0.2881 0.0452  -6.37 <0.0001 
21 sub_gradeB4                        -0.3081 0.0538  -5.73 <0.0001 
22 sub_gradeB5                        -0.3517 0.0593  -5.93 <0.0001 
23 sub_gradeC1                        -0.3983 0.0652  -6.11 <0.0001 
24 sub_gradeC2                        -0.4362 0.0696  -6.27 <0.0001 
25 sub_gradeC3                        -0.4365 0.0736  -5.93 <0.0001 
26 sub_gradeC4                        -0.4644 0.0782  -5.94 <0.0001 
27 sub_gradeC5                        -0.4586 0.0788  -5.82 <0.0001 
28 sub_gradeD1                        -0.4120 0.0782  -5.27 <0.0001 
29 sub_gradeD2                        -0.4141 0.0766  -5.41 <0.0001 
30 sub_gradeD3                        -0.3961 0.0731  -5.42 <0.0001 
31 sub_gradeD4                        -0.3982 0.0795  -5.01 <0.0001 
32 sub_gradeD5                        -0.3800 0.0765  -4.97 <0.0001 
33 sub_gradeE1                        -0.4097 0.0789  -5.19 <0.0001 
34 sub_gradeE2                        -0.3699 0.0753  -4.91 <0.0001 
35 sub_gradeE3                        -0.3582 0.0750  -4.78 <0.0001 
36 sub_gradeE4                        -0.3420 0.0747  -4.58 <0.0001 
37 sub_gradeE5                        -0.3449 0.0713  -4.84 <0.0001 
38 sub_gradeF1                        -0.3061 0.0668  -4.58 <0.0001 
39 sub_gradeF2                        -0.2922 0.0622  -4.69 <0.0001 
40 sub_gradeF3                        -0.2441 0.0544  -4.49 <0.0001 
41 sub_gradeF4                        -0.2390 0.0542  -4.41 <0.0001 
42 sub_gradeF5                        -0.2593 0.0553  -4.69 <0.0001 
43 sub_gradeG1                        -0.1844 0.0452  -4.08 <0.0001 
44 sub_gradeG2                        -0.1563 0.0391  -4.00 <0.0001 
45 sub_gradeG3                        -0.1486 0.0368  -4.04 <0.0001 
46 sub_gradeG4                        -0.1447 0.0339  -4.27 <0.0001 
47 sub_gradeG5                        -0.1375 0.0355  -3.88 0.0001  
48 emp_length10+ years                 0.0273 0.0234   1.17 0.2429  
49 emp_length2 years                  -0.0005 0.0167  -0.03 0.9759  
50 emp_length3 years                   0.0031 0.0163   0.19 0.8475  
51 emp_length4 years                   0.0212 0.0152   1.39 0.1632  
52 emp_length5 years                  -0.0013 0.0152  -0.09 0.9293  
53 emp_length6 years                   0.0061 0.0142   0.43 0.6702  
54 emp_length7 years                   0.0048 0.0138   0.35 0.7272  
55 emp_length8 years                   0.0003 0.0146   0.02 0.9838  
56 emp_length9 years                   0.0040 0.0143   0.28 0.7791  
57 emp_length< 1 year                  0.0003 0.0160   0.02 0.9848  
58 emp_lengthn/a                      -0.1120 0.0159  -7.05 <0.0001 
59 home_ownershipOWN                  -0.0288 0.0126  -2.28 0.0224  
60 home_ownershipRENT                 -0.1624 0.0147 -11.04 <0.0001 
61 annual_inc                          0.0633 0.0258   2.45 0.0141  
62 verification_statusSource Verified -0.0386 0.0144  -2.68 0.0073  
63 verification_statusVerified        -0.0457 0.0144  -3.17 0.0015  
64 purposecredit_card                 -0.1365 0.0552  -2.47 0.0134  
65 purposedebt_consolidation          -0.1427 0.0649  -2.20 0.0279  
66 purposehome_improvement            -0.1006 0.0347  -2.90 0.0037  
67 purposehouse                       -0.0145 0.0142  -1.02 0.3068  
68 purposemajor_purchase              -0.0729 0.0226  -3.23 0.0012  
69 purposemedical                     -0.0627 0.0178  -3.53 0.0004  
70 purposemoving                      -0.0400 0.0153  -2.62 0.0088  
71 purposeother                       -0.0868 0.0320  -2.71 0.0066  
72 purposerenewable_energy            -0.0020 0.0116  -0.17 0.8627  
73 purposesmall_business              -0.0780 0.0166  -4.70 <0.0001 
74 purposevacation                    -0.0498 0.0153  -3.26 0.0011  
75 dti                                -0.1527 0.0144 -10.61 <0.0001 
76 delinq_2yrs                        -0.0756 0.0187  -4.05 <0.0001 
77 earliest_cr_line                   -0.0635 0.0139  -4.58 <0.0001 
78 inq_last_6mths                     -0.0262 0.0148  -1.77 0.0767  
79 mths_since_last_delinq              0.0379 0.0136   2.78 0.0054  
80 pub_rec                            -0.0063 0.0326  -0.19 0.8469  
81 revol_bal                          -0.0349 0.0148  -2.36 0.0182  
82 revol_util                          0.2766 0.1808   1.53 0.1260  
83 initial_list_statusw                0.0163 0.0119   1.37 0.1711  
84 collections_12_mths_ex_med         -0.0230 0.0114  -2.03 0.0427  
85 application_typeJoint App           0.0452 0.0116   3.90 <0.0001 
86 acc_now_delinq                      0.0180 0.0120   1.50 0.1342  
87 tot_coll_amt                        0.0154 0.0125   1.23 0.2193  
88 open_acc_6m                        -0.0630 0.0174  -3.61 0.0003  
89 open_act_il                         0.0165 0.0167   0.99 0.3243  
90 open_il_12m                        -0.0291 0.0156  -1.87 0.0613  
91 mths_since_rcnt_il                 -0.0208 0.0137  -1.52 0.1284  
92 total_bal_il                        0.0334 0.0165   2.03 0.0425  
93 il_util                            -0.0053 0.0151  -0.35 0.7259  
94 open_rv_12m                        -0.0010 0.0167  -0.06 0.9540  
95 max_bal_bc                          0.1018 0.0152   6.69 <0.0001 
96 all_util                           -0.0949 0.0188  -5.06 <0.0001 
97 inq_fi                             -0.0578 0.0137  -4.20 <0.0001 
98 total_cu_tl                         0.0550 0.0130   4.23 <0.0001 
99 inq_last_12m                        0.0222 0.0166   1.34 0.1810  
100 avg_cur_bal                         0.1261 0.0229   5.50 <0.0001 
101 bc_open_to_buy                      0.0915 0.0206   4.44 <0.0001 
102 chargeoff_within_12_mths           -0.0159 0.0113  -1.41 0.1588  
103 delinq_amnt                        -0.0136 0.0106  -1.27 0.2025  
104 mo_sin_old_il_acct                 -0.0214 0.0132  -1.62 0.1045  
105 mo_sin_rcnt_rev_tl_op               0.0223 0.0186   1.20 0.2312  
106 mo_sin_rcnt_tl                      0.0067 0.0172   0.39 0.6968  
107 mort_acc                            0.1478 0.0157   9.39 <0.0001 
108 mths_since_recent_bc                0.0425 0.0161   2.65 0.0081  
109 mths_since_recent_inq               0.0233 0.0144   1.62 0.1056  
110 num_accts_ever_120_pd               0.0096 0.0153   0.63 0.5298  
111 num_actv_bc_tl                     -0.1563 0.0157  -9.96 <0.0001 
112 num_bc_tl                           0.0832 0.0169   4.92 <0.0001 
113 num_il_tl                           0.0373 0.0180   2.07 0.0381  
114 num_tl_90g_dpd_24m                  0.0505 0.0162   3.12 0.0018  
115 pct_tl_nvr_dlq                      0.0310 0.0160   1.93 0.0536  
116 percent_bc_gt_75                   -0.0761 0.0150  -5.07 <0.0001 
117 pub_rec_bankruptcies                0.0089 0.0225   0.40 0.6911  
118 tax_liens                          -0.0056 0.0252  -0.22 0.8238  
119 is_ny                              -0.0276 0.0117  -2.36 0.0185  
120 is_pa                              -0.0312 0.0113  -2.77 0.0056  
121 is_nj                              -0.0254 0.0115  -2.21 0.0269  
122 is_oh                              -0.0229 0.0115  -2.00 0.0458  
123 is_fl                              -0.0108 0.0118  -0.91 0.3618  
124 is_co                               0.0419 0.0125   3.35 0.0008  
125 is_ga                               0.0292 0.0121   2.42 0.0155  
126 is_va                              -0.0130 0.0117  -1.11 0.2665  
127 is_az                              -0.0042 0.0118  -0.36 0.7205  
128 is_ca                               0.0124 0.0125   0.99 0.3205  
129>
130

We set our two-tailed p-value cutoff at 0.01, we discard features with p-value exceed this threshold.

1> tmp = as.data.frame(anova(model))
2> tmp$feature = rownames(tmp)
3> tmp = tmp %>% filter(P > 0.01) %>% select(feature,P)
4> tmp
5                      feature          P
61         emp_length10+ years 0.24294486
72           emp_length2 years 0.97590140
83           emp_length3 years 0.84747663
94           emp_length4 years 0.16323613
105           emp_length5 years 0.92929075
116           emp_length6 years 0.67016558
127           emp_length7 years 0.72718053
138           emp_length8 years 0.98379691
149           emp_length9 years 0.77913329
1510         emp_length< 1 year 0.98481178
1611          home_ownershipOWN 0.02242193
1712                 annual_inc 0.01409361
1813         purposecredit_card 0.01339868
1914  purposedebt_consolidation 0.02785530
2015               purposehouse 0.30682639
2116    purposerenewable_energy 0.86272358
2217             inq_last_6mths 0.07665570
2318                    pub_rec 0.84689291
2419                  revol_bal 0.01821084
2520                 revol_util 0.12604079
2621       initial_list_statusw 0.17105737
2722 collections_12_mths_ex_med 0.04272762
2823             acc_now_delinq 0.13421249
2924               tot_coll_amt 0.21928677
3025                open_act_il 0.32432270
3126                open_il_12m 0.06133265
3227         mths_since_rcnt_il 0.12838908
3328               total_bal_il 0.04248475
3429                    il_util 0.72592433
3530                open_rv_12m 0.95403016
3631               inq_last_12m 0.18102960
3732   chargeoff_within_12_mths 0.15879401
3833                delinq_amnt 0.20247434
3934         mo_sin_old_il_acct 0.10452072
4035      mo_sin_rcnt_rev_tl_op 0.23123190
4136             mo_sin_rcnt_tl 0.69681695
4237      mths_since_recent_inq 0.10561320
4338      num_accts_ever_120_pd 0.52975942
4439                  num_il_tl 0.03809880
4540             pct_tl_nvr_dlq 0.05360825
4641       pub_rec_bankruptcies 0.69105180
4742                  tax_liens 0.82384179
4843                      is_ny 0.01850732
4944                      is_nj 0.02686627
5045                      is_oh 0.04582818
5146                      is_fl 0.36182627
5247                      is_ga 0.01551415
5348                      is_va 0.26648933
5449                      is_az 0.72047895
5550                      is_ca 0.32050010
56>
57
1> data_train = (data_train[,!(names(data_train) %in% tmp$feature)])
2> rm(model,tmp)
3

Some feature names are invalid. We will replace invalid characters with "_"

1colnames(data_train) = str_replace_all(colnames(data_train)," ","_")
2colnames(data_train) = str_replace_all(colnames(data_train),"<","_")
3colnames(data_train) = str_replace_all(colnames(data_train),"/","_")
4

Our data is now ready for building the predictive model.