Remove Dimensions By Fitting Logistic Regression

We will use the preProcess function from the caret package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value.

We can try to remove the number of dimensions further by fitting Logistic regression and investigate p-value of the coefficients. The null hypothesis is that each feature makes no contribution to the predictive model (its coefficient is zero). We then discard each feature that fails to reject the hypothesis.

trans_model = preProcess(data_train,method=c("center","scale"))
data_train = predict(trans_model, data_train)
model = lrm(loan_status ~ .,data_train)

(Required Library - caret, rms)

Let's take a look at our model.

> model
Logistic Regression Model
 lrm(formula = loan_status ~ ., data = data_train)
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs         41909    LR chi2    6127.08    R2       0.193    C       0.731    
  Default    12630    d.f.           116    g        1.093    Dxy     0.463    
  Fully.Paid 29279    Pr(> chi2) <0.0001    gr       2.983    gamma   0.463    
 max |deriv| 3e-12                          gp       0.194    tau-a   0.195    
                                            Brier    0.181                     
                                    Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept                           1.0124 0.0127  79.92 <0.0001 
 term 60 months                     -0.1358 0.0137  -9.93 <0.0001 
 sub_gradeA2                        -0.0921 0.0232  -3.97 <0.0001 
 sub_gradeA3                        -0.0820 0.0245  -3.35 0.0008  
 sub_gradeA4                        -0.0910 0.0260  -3.50 0.0005  
 sub_gradeA5                        -0.1751 0.0298  -5.88 <0.0001 
 sub_gradeB1                        -0.2126 0.0347  -6.12 <0.0001 
 sub_gradeB2                        -0.2266 0.0404  -5.61 <0.0001 
 sub_gradeB3                        -0.2881 0.0452  -6.37 <0.0001 
 sub_gradeB4                        -0.3081 0.0538  -5.73 <0.0001 
 sub_gradeB5                        -0.3517 0.0593  -5.93 <0.0001 
 sub_gradeC1                        -0.3983 0.0652  -6.11 <0.0001 
 sub_gradeC2                        -0.4362 0.0696  -6.27 <0.0001 
 sub_gradeC3                        -0.4365 0.0736  -5.93 <0.0001 
 sub_gradeC4                        -0.4644 0.0782  -5.94 <0.0001 
 sub_gradeC5                        -0.4586 0.0788  -5.82 <0.0001 
 sub_gradeD1                        -0.4120 0.0782  -5.27 <0.0001 
 sub_gradeD2                        -0.4141 0.0766  -5.41 <0.0001 
 sub_gradeD3                        -0.3961 0.0731  -5.42 <0.0001 
 sub_gradeD4                        -0.3982 0.0795  -5.01 <0.0001 
 sub_gradeD5                        -0.3800 0.0765  -4.97 <0.0001 
 sub_gradeE1                        -0.4097 0.0789  -5.19 <0.0001 
 sub_gradeE2                        -0.3699 0.0753  -4.91 <0.0001 
 sub_gradeE3                        -0.3582 0.0750  -4.78 <0.0001 
 sub_gradeE4                        -0.3420 0.0747  -4.58 <0.0001 
 sub_gradeE5                        -0.3449 0.0713  -4.84 <0.0001 
 sub_gradeF1                        -0.3061 0.0668  -4.58 <0.0001 
 sub_gradeF2                        -0.2922 0.0622  -4.69 <0.0001 
 sub_gradeF3                        -0.2441 0.0544  -4.49 <0.0001 
 sub_gradeF4                        -0.2390 0.0542  -4.41 <0.0001 
 sub_gradeF5                        -0.2593 0.0553  -4.69 <0.0001 
 sub_gradeG1                        -0.1844 0.0452  -4.08 <0.0001 
 sub_gradeG2                        -0.1563 0.0391  -4.00 <0.0001 
 sub_gradeG3                        -0.1486 0.0368  -4.04 <0.0001 
 sub_gradeG4                        -0.1447 0.0339  -4.27 <0.0001 
 sub_gradeG5                        -0.1375 0.0355  -3.88 0.0001  
 emp_length10+ years                 0.0273 0.0234   1.17 0.2429  
 emp_length2 years                  -0.0005 0.0167  -0.03 0.9759  
 emp_length3 years                   0.0031 0.0163   0.19 0.8475  
 emp_length4 years                   0.0212 0.0152   1.39 0.1632  
 emp_length5 years                  -0.0013 0.0152  -0.09 0.9293  
 emp_length6 years                   0.0061 0.0142   0.43 0.6702  
 emp_length7 years                   0.0048 0.0138   0.35 0.7272  
 emp_length8 years                   0.0003 0.0146   0.02 0.9838  
 emp_length9 years                   0.0040 0.0143   0.28 0.7791  
 emp_length< 1 year                  0.0003 0.0160   0.02 0.9848  
 emp_lengthn/a                      -0.1120 0.0159  -7.05 <0.0001 
 home_ownershipOWN                  -0.0288 0.0126  -2.28 0.0224  
 home_ownershipRENT                 -0.1624 0.0147 -11.04 <0.0001 
 annual_inc                          0.0633 0.0258   2.45 0.0141  
 verification_statusSource Verified -0.0386 0.0144  -2.68 0.0073  
 verification_statusVerified        -0.0457 0.0144  -3.17 0.0015  
 purposecredit_card                 -0.1365 0.0552  -2.47 0.0134  
 purposedebt_consolidation          -0.1427 0.0649  -2.20 0.0279  
 purposehome_improvement            -0.1006 0.0347  -2.90 0.0037  
 purposehouse                       -0.0145 0.0142  -1.02 0.3068  
 purposemajor_purchase              -0.0729 0.0226  -3.23 0.0012  
 purposemedical                     -0.0627 0.0178  -3.53 0.0004  
 purposemoving                      -0.0400 0.0153  -2.62 0.0088  
 purposeother                       -0.0868 0.0320  -2.71 0.0066  
 purposerenewable_energy            -0.0020 0.0116  -0.17 0.8627  
 purposesmall_business              -0.0780 0.0166  -4.70 <0.0001 
 purposevacation                    -0.0498 0.0153  -3.26 0.0011  
 dti                                -0.1527 0.0144 -10.61 <0.0001 
 delinq_2yrs                        -0.0756 0.0187  -4.05 <0.0001 
 earliest_cr_line                   -0.0635 0.0139  -4.58 <0.0001 
 inq_last_6mths                     -0.0262 0.0148  -1.77 0.0767  
 mths_since_last_delinq              0.0379 0.0136   2.78 0.0054  
 pub_rec                            -0.0063 0.0326  -0.19 0.8469  
 revol_bal                          -0.0349 0.0148  -2.36 0.0182  
 revol_util                          0.2766 0.1808   1.53 0.1260  
 initial_list_statusw                0.0163 0.0119   1.37 0.1711  
 collections_12_mths_ex_med         -0.0230 0.0114  -2.03 0.0427  
 application_typeJoint App           0.0452 0.0116   3.90 <0.0001 
 acc_now_delinq                      0.0180 0.0120   1.50 0.1342  
 tot_coll_amt                        0.0154 0.0125   1.23 0.2193  
 open_acc_6m                        -0.0630 0.0174  -3.61 0.0003  
 open_act_il                         0.0165 0.0167   0.99 0.3243  
 open_il_12m                        -0.0291 0.0156  -1.87 0.0613  
 mths_since_rcnt_il                 -0.0208 0.0137  -1.52 0.1284  
 total_bal_il                        0.0334 0.0165   2.03 0.0425  
 il_util                            -0.0053 0.0151  -0.35 0.7259  
 open_rv_12m                        -0.0010 0.0167  -0.06 0.9540  
 max_bal_bc                          0.1018 0.0152   6.69 <0.0001 
 all_util                           -0.0949 0.0188  -5.06 <0.0001 
 inq_fi                             -0.0578 0.0137  -4.20 <0.0001 
 total_cu_tl                         0.0550 0.0130   4.23 <0.0001 
 inq_last_12m                        0.0222 0.0166   1.34 0.1810  
 avg_cur_bal                         0.1261 0.0229   5.50 <0.0001 
 bc_open_to_buy                      0.0915 0.0206   4.44 <0.0001 
 chargeoff_within_12_mths           -0.0159 0.0113  -1.41 0.1588  
 delinq_amnt                        -0.0136 0.0106  -1.27 0.2025  
 mo_sin_old_il_acct                 -0.0214 0.0132  -1.62 0.1045  
 mo_sin_rcnt_rev_tl_op               0.0223 0.0186   1.20 0.2312  
 mo_sin_rcnt_tl                      0.0067 0.0172   0.39 0.6968  
 mort_acc                            0.1478 0.0157   9.39 <0.0001 
 mths_since_recent_bc                0.0425 0.0161   2.65 0.0081  
 mths_since_recent_inq               0.0233 0.0144   1.62 0.1056  
 num_accts_ever_120_pd               0.0096 0.0153   0.63 0.5298  
 num_actv_bc_tl                     -0.1563 0.0157  -9.96 <0.0001 
 num_bc_tl                           0.0832 0.0169   4.92 <0.0001 
 num_il_tl                           0.0373 0.0180   2.07 0.0381  
 num_tl_90g_dpd_24m                  0.0505 0.0162   3.12 0.0018  
 pct_tl_nvr_dlq                      0.0310 0.0160   1.93 0.0536  
 percent_bc_gt_75                   -0.0761 0.0150  -5.07 <0.0001 
 pub_rec_bankruptcies                0.0089 0.0225   0.40 0.6911  
 tax_liens                          -0.0056 0.0252  -0.22 0.8238  
 is_ny                              -0.0276 0.0117  -2.36 0.0185  
 is_pa                              -0.0312 0.0113  -2.77 0.0056  
 is_nj                              -0.0254 0.0115  -2.21 0.0269  
 is_oh                              -0.0229 0.0115  -2.00 0.0458  
 is_fl                              -0.0108 0.0118  -0.91 0.3618  
 is_co                               0.0419 0.0125   3.35 0.0008  
 is_ga                               0.0292 0.0121   2.42 0.0155  
 is_va                              -0.0130 0.0117  -1.11 0.2665  
 is_az                              -0.0042 0.0118  -0.36 0.7205  
 is_ca                               0.0124 0.0125   0.99 0.3205  
>

We set our two-tailed p-value cutoff at 0.01, we discard features with p-value exceed this threshold.

> tmp = as.data.frame(anova(model))
> tmp$feature = rownames(tmp)
> tmp = tmp %>% filter(P > 0.01) %>% select(feature,P)
> tmp
                      feature          P
1         emp_length10+ years 0.24294486
2           emp_length2 years 0.97590140
3           emp_length3 years 0.84747663
4           emp_length4 years 0.16323613
5           emp_length5 years 0.92929075
6           emp_length6 years 0.67016558
7           emp_length7 years 0.72718053
8           emp_length8 years 0.98379691
9           emp_length9 years 0.77913329
10         emp_length< 1 year 0.98481178
11          home_ownershipOWN 0.02242193
12                 annual_inc 0.01409361
13         purposecredit_card 0.01339868
14  purposedebt_consolidation 0.02785530
15               purposehouse 0.30682639
16    purposerenewable_energy 0.86272358
17             inq_last_6mths 0.07665570
18                    pub_rec 0.84689291
19                  revol_bal 0.01821084
20                 revol_util 0.12604079
21       initial_list_statusw 0.17105737
22 collections_12_mths_ex_med 0.04272762
23             acc_now_delinq 0.13421249
24               tot_coll_amt 0.21928677
25                open_act_il 0.32432270
26                open_il_12m 0.06133265
27         mths_since_rcnt_il 0.12838908
28               total_bal_il 0.04248475
29                    il_util 0.72592433
30                open_rv_12m 0.95403016
31               inq_last_12m 0.18102960
32   chargeoff_within_12_mths 0.15879401
33                delinq_amnt 0.20247434
34         mo_sin_old_il_acct 0.10452072
35      mo_sin_rcnt_rev_tl_op 0.23123190
36             mo_sin_rcnt_tl 0.69681695
37      mths_since_recent_inq 0.10561320
38      num_accts_ever_120_pd 0.52975942
39                  num_il_tl 0.03809880
40             pct_tl_nvr_dlq 0.05360825
41       pub_rec_bankruptcies 0.69105180
42                  tax_liens 0.82384179
43                      is_ny 0.01850732
44                      is_nj 0.02686627
45                      is_oh 0.04582818
46                      is_fl 0.36182627
47                      is_ga 0.01551415
48                      is_va 0.26648933
49                      is_az 0.72047895
50                      is_ca 0.32050010
>
> data_train = (data_train[,!(names(data_train) %in% tmp$feature)])
> rm(model,tmp)

Some feature names are invalid. We will replace invalid characters with "_"

colnames(data_train) = str_replace_all(colnames(data_train)," ","_")
colnames(data_train) = str_replace_all(colnames(data_train),"<","_")
colnames(data_train) = str_replace_all(colnames(data_train),"/","_")

Our data is now ready for building the predictive model.

Related Downloads

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $29 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.