Remove Dimensions By Fitting Logistic Regression
We will use the preProcess
function from the caret
package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value.
We can try to remove the number of dimensions further by fitting Logistic regression and investigate p-value of the coefficients. The null hypothesis is that each feature makes no contribution to the predictive model (its coefficient is zero). We then discard each feature that fails to reject the hypothesis.
1trans_model = preProcess(data_train,method=c("center","scale"))
2data_train = predict(trans_model, data_train)
3model = lrm(loan_status ~ .,data_train)
4
(Required Library - caret, rms)
Let's take a look at our model.
1> model
2Logistic Regression Model
3 lrm(formula = loan_status ~ ., data = data_train)
4 Model Likelihood Discrimination Rank Discrim.
5 Ratio Test Indexes Indexes
6 Obs 41909 LR chi2 6127.08 R2 0.193 C 0.731
7 Default 12630 d.f. 116 g 1.093 Dxy 0.463
8 Fully.Paid 29279 Pr(> chi2) <0.0001 gr 2.983 gamma 0.463
9 max |deriv| 3e-12 gp 0.194 tau-a 0.195
10 Brier 0.181
11 Coef S.E. Wald Z Pr(>|Z|)
12 Intercept 1.0124 0.0127 79.92 <0.0001
13 term 60 months -0.1358 0.0137 -9.93 <0.0001
14 sub_gradeA2 -0.0921 0.0232 -3.97 <0.0001
15 sub_gradeA3 -0.0820 0.0245 -3.35 0.0008
16 sub_gradeA4 -0.0910 0.0260 -3.50 0.0005
17 sub_gradeA5 -0.1751 0.0298 -5.88 <0.0001
18 sub_gradeB1 -0.2126 0.0347 -6.12 <0.0001
19 sub_gradeB2 -0.2266 0.0404 -5.61 <0.0001
20 sub_gradeB3 -0.2881 0.0452 -6.37 <0.0001
21 sub_gradeB4 -0.3081 0.0538 -5.73 <0.0001
22 sub_gradeB5 -0.3517 0.0593 -5.93 <0.0001
23 sub_gradeC1 -0.3983 0.0652 -6.11 <0.0001
24 sub_gradeC2 -0.4362 0.0696 -6.27 <0.0001
25 sub_gradeC3 -0.4365 0.0736 -5.93 <0.0001
26 sub_gradeC4 -0.4644 0.0782 -5.94 <0.0001
27 sub_gradeC5 -0.4586 0.0788 -5.82 <0.0001
28 sub_gradeD1 -0.4120 0.0782 -5.27 <0.0001
29 sub_gradeD2 -0.4141 0.0766 -5.41 <0.0001
30 sub_gradeD3 -0.3961 0.0731 -5.42 <0.0001
31 sub_gradeD4 -0.3982 0.0795 -5.01 <0.0001
32 sub_gradeD5 -0.3800 0.0765 -4.97 <0.0001
33 sub_gradeE1 -0.4097 0.0789 -5.19 <0.0001
34 sub_gradeE2 -0.3699 0.0753 -4.91 <0.0001
35 sub_gradeE3 -0.3582 0.0750 -4.78 <0.0001
36 sub_gradeE4 -0.3420 0.0747 -4.58 <0.0001
37 sub_gradeE5 -0.3449 0.0713 -4.84 <0.0001
38 sub_gradeF1 -0.3061 0.0668 -4.58 <0.0001
39 sub_gradeF2 -0.2922 0.0622 -4.69 <0.0001
40 sub_gradeF3 -0.2441 0.0544 -4.49 <0.0001
41 sub_gradeF4 -0.2390 0.0542 -4.41 <0.0001
42 sub_gradeF5 -0.2593 0.0553 -4.69 <0.0001
43 sub_gradeG1 -0.1844 0.0452 -4.08 <0.0001
44 sub_gradeG2 -0.1563 0.0391 -4.00 <0.0001
45 sub_gradeG3 -0.1486 0.0368 -4.04 <0.0001
46 sub_gradeG4 -0.1447 0.0339 -4.27 <0.0001
47 sub_gradeG5 -0.1375 0.0355 -3.88 0.0001
48 emp_length10+ years 0.0273 0.0234 1.17 0.2429
49 emp_length2 years -0.0005 0.0167 -0.03 0.9759
50 emp_length3 years 0.0031 0.0163 0.19 0.8475
51 emp_length4 years 0.0212 0.0152 1.39 0.1632
52 emp_length5 years -0.0013 0.0152 -0.09 0.9293
53 emp_length6 years 0.0061 0.0142 0.43 0.6702
54 emp_length7 years 0.0048 0.0138 0.35 0.7272
55 emp_length8 years 0.0003 0.0146 0.02 0.9838
56 emp_length9 years 0.0040 0.0143 0.28 0.7791
57 emp_length< 1 year 0.0003 0.0160 0.02 0.9848
58 emp_lengthn/a -0.1120 0.0159 -7.05 <0.0001
59 home_ownershipOWN -0.0288 0.0126 -2.28 0.0224
60 home_ownershipRENT -0.1624 0.0147 -11.04 <0.0001
61 annual_inc 0.0633 0.0258 2.45 0.0141
62 verification_statusSource Verified -0.0386 0.0144 -2.68 0.0073
63 verification_statusVerified -0.0457 0.0144 -3.17 0.0015
64 purposecredit_card -0.1365 0.0552 -2.47 0.0134
65 purposedebt_consolidation -0.1427 0.0649 -2.20 0.0279
66 purposehome_improvement -0.1006 0.0347 -2.90 0.0037
67 purposehouse -0.0145 0.0142 -1.02 0.3068
68 purposemajor_purchase -0.0729 0.0226 -3.23 0.0012
69 purposemedical -0.0627 0.0178 -3.53 0.0004
70 purposemoving -0.0400 0.0153 -2.62 0.0088
71 purposeother -0.0868 0.0320 -2.71 0.0066
72 purposerenewable_energy -0.0020 0.0116 -0.17 0.8627
73 purposesmall_business -0.0780 0.0166 -4.70 <0.0001
74 purposevacation -0.0498 0.0153 -3.26 0.0011
75 dti -0.1527 0.0144 -10.61 <0.0001
76 delinq_2yrs -0.0756 0.0187 -4.05 <0.0001
77 earliest_cr_line -0.0635 0.0139 -4.58 <0.0001
78 inq_last_6mths -0.0262 0.0148 -1.77 0.0767
79 mths_since_last_delinq 0.0379 0.0136 2.78 0.0054
80 pub_rec -0.0063 0.0326 -0.19 0.8469
81 revol_bal -0.0349 0.0148 -2.36 0.0182
82 revol_util 0.2766 0.1808 1.53 0.1260
83 initial_list_statusw 0.0163 0.0119 1.37 0.1711
84 collections_12_mths_ex_med -0.0230 0.0114 -2.03 0.0427
85 application_typeJoint App 0.0452 0.0116 3.90 <0.0001
86 acc_now_delinq 0.0180 0.0120 1.50 0.1342
87 tot_coll_amt 0.0154 0.0125 1.23 0.2193
88 open_acc_6m -0.0630 0.0174 -3.61 0.0003
89 open_act_il 0.0165 0.0167 0.99 0.3243
90 open_il_12m -0.0291 0.0156 -1.87 0.0613
91 mths_since_rcnt_il -0.0208 0.0137 -1.52 0.1284
92 total_bal_il 0.0334 0.0165 2.03 0.0425
93 il_util -0.0053 0.0151 -0.35 0.7259
94 open_rv_12m -0.0010 0.0167 -0.06 0.9540
95 max_bal_bc 0.1018 0.0152 6.69 <0.0001
96 all_util -0.0949 0.0188 -5.06 <0.0001
97 inq_fi -0.0578 0.0137 -4.20 <0.0001
98 total_cu_tl 0.0550 0.0130 4.23 <0.0001
99 inq_last_12m 0.0222 0.0166 1.34 0.1810
100 avg_cur_bal 0.1261 0.0229 5.50 <0.0001
101 bc_open_to_buy 0.0915 0.0206 4.44 <0.0001
102 chargeoff_within_12_mths -0.0159 0.0113 -1.41 0.1588
103 delinq_amnt -0.0136 0.0106 -1.27 0.2025
104 mo_sin_old_il_acct -0.0214 0.0132 -1.62 0.1045
105 mo_sin_rcnt_rev_tl_op 0.0223 0.0186 1.20 0.2312
106 mo_sin_rcnt_tl 0.0067 0.0172 0.39 0.6968
107 mort_acc 0.1478 0.0157 9.39 <0.0001
108 mths_since_recent_bc 0.0425 0.0161 2.65 0.0081
109 mths_since_recent_inq 0.0233 0.0144 1.62 0.1056
110 num_accts_ever_120_pd 0.0096 0.0153 0.63 0.5298
111 num_actv_bc_tl -0.1563 0.0157 -9.96 <0.0001
112 num_bc_tl 0.0832 0.0169 4.92 <0.0001
113 num_il_tl 0.0373 0.0180 2.07 0.0381
114 num_tl_90g_dpd_24m 0.0505 0.0162 3.12 0.0018
115 pct_tl_nvr_dlq 0.0310 0.0160 1.93 0.0536
116 percent_bc_gt_75 -0.0761 0.0150 -5.07 <0.0001
117 pub_rec_bankruptcies 0.0089 0.0225 0.40 0.6911
118 tax_liens -0.0056 0.0252 -0.22 0.8238
119 is_ny -0.0276 0.0117 -2.36 0.0185
120 is_pa -0.0312 0.0113 -2.77 0.0056
121 is_nj -0.0254 0.0115 -2.21 0.0269
122 is_oh -0.0229 0.0115 -2.00 0.0458
123 is_fl -0.0108 0.0118 -0.91 0.3618
124 is_co 0.0419 0.0125 3.35 0.0008
125 is_ga 0.0292 0.0121 2.42 0.0155
126 is_va -0.0130 0.0117 -1.11 0.2665
127 is_az -0.0042 0.0118 -0.36 0.7205
128 is_ca 0.0124 0.0125 0.99 0.3205
129>
130
We set our two-tailed p-value cutoff at 0.01, we discard features with p-value exceed this threshold.
1> tmp = as.data.frame(anova(model))
2> tmp$feature = rownames(tmp)
3> tmp = tmp %>% filter(P > 0.01) %>% select(feature,P)
4> tmp
5 feature P
61 emp_length10+ years 0.24294486
72 emp_length2 years 0.97590140
83 emp_length3 years 0.84747663
94 emp_length4 years 0.16323613
105 emp_length5 years 0.92929075
116 emp_length6 years 0.67016558
127 emp_length7 years 0.72718053
138 emp_length8 years 0.98379691
149 emp_length9 years 0.77913329
1510 emp_length< 1 year 0.98481178
1611 home_ownershipOWN 0.02242193
1712 annual_inc 0.01409361
1813 purposecredit_card 0.01339868
1914 purposedebt_consolidation 0.02785530
2015 purposehouse 0.30682639
2116 purposerenewable_energy 0.86272358
2217 inq_last_6mths 0.07665570
2318 pub_rec 0.84689291
2419 revol_bal 0.01821084
2520 revol_util 0.12604079
2621 initial_list_statusw 0.17105737
2722 collections_12_mths_ex_med 0.04272762
2823 acc_now_delinq 0.13421249
2924 tot_coll_amt 0.21928677
3025 open_act_il 0.32432270
3126 open_il_12m 0.06133265
3227 mths_since_rcnt_il 0.12838908
3328 total_bal_il 0.04248475
3429 il_util 0.72592433
3530 open_rv_12m 0.95403016
3631 inq_last_12m 0.18102960
3732 chargeoff_within_12_mths 0.15879401
3833 delinq_amnt 0.20247434
3934 mo_sin_old_il_acct 0.10452072
4035 mo_sin_rcnt_rev_tl_op 0.23123190
4136 mo_sin_rcnt_tl 0.69681695
4237 mths_since_recent_inq 0.10561320
4338 num_accts_ever_120_pd 0.52975942
4439 num_il_tl 0.03809880
4540 pct_tl_nvr_dlq 0.05360825
4641 pub_rec_bankruptcies 0.69105180
4742 tax_liens 0.82384179
4843 is_ny 0.01850732
4944 is_nj 0.02686627
5045 is_oh 0.04582818
5146 is_fl 0.36182627
5247 is_ga 0.01551415
5348 is_va 0.26648933
5449 is_az 0.72047895
5550 is_ca 0.32050010
56>
57
1> data_train = (data_train[,!(names(data_train) %in% tmp$feature)])
2> rm(model,tmp)
3
Some feature names are invalid. We will replace invalid characters with "_"
1colnames(data_train) = str_replace_all(colnames(data_train)," ","_")
2colnames(data_train) = str_replace_all(colnames(data_train),"<","_")
3colnames(data_train) = str_replace_all(colnames(data_train),"/","_")
4
Our data is now ready for building the predictive model.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.
You'll get access to:
- Access complete tutorials and examples
- Download source code and resources
- Follow along with practical exercises
- Get in-depth explanations