# Remove Dimensions By Fitting Logistic Regression

We will use the preProcess function from the caret package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value.

We can try to remove the number of dimensions further by fitting Logistic regression and investigate p-value of the coefficients. The null hypothesis is that each feature makes no contribution to the predictive model (its coefficient is zero). We then discard each feature that fails to reject the hypothesis.

trans_model = preProcess(data_train,method=c("center","scale"))
data_train = predict(trans_model, data_train)
model = lrm(loan_status ~ .,data_train)


(Required Library - caret, rms)

Let's take a look at our model.

> model
Logistic Regression Model
lrm(formula = loan_status ~ ., data = data_train)
Model Likelihood     Discrimination    Rank Discrim.
Ratio Test           Indexes           Indexes
Obs         41909    LR chi2    6127.08    R2       0.193    C       0.731
Default    12630    d.f.           116    g        1.093    Dxy     0.463
Fully.Paid 29279    Pr(> chi2) <0.0001    gr       2.983    gamma   0.463
max |deriv| 3e-12                          gp       0.194    tau-a   0.195
Brier    0.181
Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                           1.0124 0.0127  79.92 <0.0001
term 60 months                     -0.1358 0.0137  -9.93 <0.0001
emp_length10+ years                 0.0273 0.0234   1.17 0.2429
emp_length2 years                  -0.0005 0.0167  -0.03 0.9759
emp_length3 years                   0.0031 0.0163   0.19 0.8475
emp_length4 years                   0.0212 0.0152   1.39 0.1632
emp_length5 years                  -0.0013 0.0152  -0.09 0.9293
emp_length6 years                   0.0061 0.0142   0.43 0.6702
emp_length7 years                   0.0048 0.0138   0.35 0.7272
emp_length8 years                   0.0003 0.0146   0.02 0.9838
emp_length9 years                   0.0040 0.0143   0.28 0.7791
emp_length< 1 year                  0.0003 0.0160   0.02 0.9848
emp_lengthn/a                      -0.1120 0.0159  -7.05 <0.0001
home_ownershipOWN                  -0.0288 0.0126  -2.28 0.0224
home_ownershipRENT                 -0.1624 0.0147 -11.04 <0.0001
annual_inc                          0.0633 0.0258   2.45 0.0141
verification_statusSource Verified -0.0386 0.0144  -2.68 0.0073
verification_statusVerified        -0.0457 0.0144  -3.17 0.0015
purposecredit_card                 -0.1365 0.0552  -2.47 0.0134
purposedebt_consolidation          -0.1427 0.0649  -2.20 0.0279
purposehome_improvement            -0.1006 0.0347  -2.90 0.0037
purposehouse                       -0.0145 0.0142  -1.02 0.3068
purposemajor_purchase              -0.0729 0.0226  -3.23 0.0012
purposemedical                     -0.0627 0.0178  -3.53 0.0004
purposemoving                      -0.0400 0.0153  -2.62 0.0088
purposeother                       -0.0868 0.0320  -2.71 0.0066
purposerenewable_energy            -0.0020 0.0116  -0.17 0.8627
purposevacation                    -0.0498 0.0153  -3.26 0.0011
dti                                -0.1527 0.0144 -10.61 <0.0001
delinq_2yrs                        -0.0756 0.0187  -4.05 <0.0001
earliest_cr_line                   -0.0635 0.0139  -4.58 <0.0001
inq_last_6mths                     -0.0262 0.0148  -1.77 0.0767
mths_since_last_delinq              0.0379 0.0136   2.78 0.0054
pub_rec                            -0.0063 0.0326  -0.19 0.8469
revol_bal                          -0.0349 0.0148  -2.36 0.0182
revol_util                          0.2766 0.1808   1.53 0.1260
initial_list_statusw                0.0163 0.0119   1.37 0.1711
collections_12_mths_ex_med         -0.0230 0.0114  -2.03 0.0427
application_typeJoint App           0.0452 0.0116   3.90 <0.0001
acc_now_delinq                      0.0180 0.0120   1.50 0.1342
tot_coll_amt                        0.0154 0.0125   1.23 0.2193
open_acc_6m                        -0.0630 0.0174  -3.61 0.0003
open_act_il                         0.0165 0.0167   0.99 0.3243
open_il_12m                        -0.0291 0.0156  -1.87 0.0613
mths_since_rcnt_il                 -0.0208 0.0137  -1.52 0.1284
total_bal_il                        0.0334 0.0165   2.03 0.0425
il_util                            -0.0053 0.0151  -0.35 0.7259
open_rv_12m                        -0.0010 0.0167  -0.06 0.9540
max_bal_bc                          0.1018 0.0152   6.69 <0.0001
all_util                           -0.0949 0.0188  -5.06 <0.0001
inq_fi                             -0.0578 0.0137  -4.20 <0.0001
total_cu_tl                         0.0550 0.0130   4.23 <0.0001
inq_last_12m                        0.0222 0.0166   1.34 0.1810
avg_cur_bal                         0.1261 0.0229   5.50 <0.0001
chargeoff_within_12_mths           -0.0159 0.0113  -1.41 0.1588
delinq_amnt                        -0.0136 0.0106  -1.27 0.2025
mo_sin_old_il_acct                 -0.0214 0.0132  -1.62 0.1045
mo_sin_rcnt_rev_tl_op               0.0223 0.0186   1.20 0.2312
mo_sin_rcnt_tl                      0.0067 0.0172   0.39 0.6968
mort_acc                            0.1478 0.0157   9.39 <0.0001
mths_since_recent_bc                0.0425 0.0161   2.65 0.0081
mths_since_recent_inq               0.0233 0.0144   1.62 0.1056
num_accts_ever_120_pd               0.0096 0.0153   0.63 0.5298
num_actv_bc_tl                     -0.1563 0.0157  -9.96 <0.0001
num_bc_tl                           0.0832 0.0169   4.92 <0.0001
num_il_tl                           0.0373 0.0180   2.07 0.0381
num_tl_90g_dpd_24m                  0.0505 0.0162   3.12 0.0018
pct_tl_nvr_dlq                      0.0310 0.0160   1.93 0.0536
percent_bc_gt_75                   -0.0761 0.0150  -5.07 <0.0001
pub_rec_bankruptcies                0.0089 0.0225   0.40 0.6911
tax_liens                          -0.0056 0.0252  -0.22 0.8238
is_ny                              -0.0276 0.0117  -2.36 0.0185
is_pa                              -0.0312 0.0113  -2.77 0.0056
is_nj                              -0.0254 0.0115  -2.21 0.0269
is_oh                              -0.0229 0.0115  -2.00 0.0458
is_fl                              -0.0108 0.0118  -0.91 0.3618
is_co                               0.0419 0.0125   3.35 0.0008
is_ga                               0.0292 0.0121   2.42 0.0155
is_va                              -0.0130 0.0117  -1.11 0.2665
is_az                              -0.0042 0.0118  -0.36 0.7205
is_ca                               0.0124 0.0125   0.99 0.3205
>


We set our two-tailed p-value cutoff at 0.01, we discard features with p-value exceed this threshold.

> tmp = as.data.frame(anova(model))
> tmp$feature = rownames(tmp) > tmp = tmp %>% filter(P > 0.01) %>% select(feature,P) > tmp feature P 1 emp_length10+ years 0.24294486 2 emp_length2 years 0.97590140 3 emp_length3 years 0.84747663 4 emp_length4 years 0.16323613 5 emp_length5 years 0.92929075 6 emp_length6 years 0.67016558 7 emp_length7 years 0.72718053 8 emp_length8 years 0.98379691 9 emp_length9 years 0.77913329 10 emp_length< 1 year 0.98481178 11 home_ownershipOWN 0.02242193 12 annual_inc 0.01409361 13 purposecredit_card 0.01339868 14 purposedebt_consolidation 0.02785530 15 purposehouse 0.30682639 16 purposerenewable_energy 0.86272358 17 inq_last_6mths 0.07665570 18 pub_rec 0.84689291 19 revol_bal 0.01821084 20 revol_util 0.12604079 21 initial_list_statusw 0.17105737 22 collections_12_mths_ex_med 0.04272762 23 acc_now_delinq 0.13421249 24 tot_coll_amt 0.21928677 25 open_act_il 0.32432270 26 open_il_12m 0.06133265 27 mths_since_rcnt_il 0.12838908 28 total_bal_il 0.04248475 29 il_util 0.72592433 30 open_rv_12m 0.95403016 31 inq_last_12m 0.18102960 32 chargeoff_within_12_mths 0.15879401 33 delinq_amnt 0.20247434 34 mo_sin_old_il_acct 0.10452072 35 mo_sin_rcnt_rev_tl_op 0.23123190 36 mo_sin_rcnt_tl 0.69681695 37 mths_since_recent_inq 0.10561320 38 num_accts_ever_120_pd 0.52975942 39 num_il_tl 0.03809880 40 pct_tl_nvr_dlq 0.05360825 41 pub_rec_bankruptcies 0.69105180 42 tax_liens 0.82384179 43 is_ny 0.01850732 44 is_nj 0.02686627 45 is_oh 0.04582818 46 is_fl 0.36182627 47 is_ga 0.01551415 48 is_va 0.26648933 49 is_az 0.72047895 50 is_ca 0.32050010 >  > data_train = (data_train[,!(names(data_train) %in% tmp$feature)])
> rm(model,tmp)


Some feature names are invalid. We will replace invalid characters with "_"

colnames(data_train) = str_replace_all(colnames(data_train)," ","_")
colnames(data_train) = str_replace_all(colnames(data_train),"<","_")
colnames(data_train) = str_replace_all(colnames(data_train),"/","_")


Our data is now ready for building the predictive model.

### You may find these interesting

Accelerate your finance career with cutting-edge data skills.
Join Finance Train Premium for unlimited access to a growing library of ebooks, projects and code examples covering financial modeling, data analysis, data science, machine learning, algorithmic trading strategies, and more applied to real-world finance scenarios.
I WANT TO JOIN
JOIN 30,000 DATA PROFESSIONALS

## Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

# Data Science in Finance: 9-Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

## What's Included:

• Getting Started with R
• R Programming for Data Science
• Data Visualization with R
• Financial Time Series Analysis with R
• Quantitative Trading Strategies with R
• Derivatives with R
• Credit Risk Modelling With R
• Python for Data Science
• Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.