Data Cleaning in R - Part 5

Numeric Features

Let’s look at all numeric features we have left.

> str(data_train[getNumericColumns(data_train)])
'data.frame':    41909 obs. of  54 variables:
 $ funded_amnt               : int  10000 35000 14400 7250 10000 10000 25000 8400 6950 16000 ...
 $ annual_inc                : num  52000 85000 85000 72000 45000 ...
 $ dti                       : num  15 24.98 28.11 23.93 8.03 ...
 $ delinq_2yrs               : int  0 0 0 1 0 0 0 0 0 0 ...
 $ earliest_cr_line          : num  5630 2647 5873 11382 9436 ...
 $ inq_last_6mths            : int  1 1 0 0 0 1 1 1 0 1 ...
 $ mths_since_last_delinq    : num  44 31 72 20 31 31 31 60 55 25 ...
 $ pub_rec                   : int  2 0 0 1 1 0 0 0 4 0 ...
 $ revol_bal                 : int  1077 10167 37582 12220 471 10139 47954 11059 7096 9891 ...
 $ revol_util                : num  19.53 20.75 10.75 13.67 5.32 ...
 $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 1 0 0 ...
 $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tot_coll_amt              : int  622 0 0 0 0 0 0 0 0 2017 ...
 $ open_acc_6m               : num  2 0 0 1 0 1 2 1 0 0 ...
 $ open_act_il               : num  1 3 3 2 1 2 2 1 0 2 ...
 $ open_il_12m               : num  4 0 1 1 0 1 2 1 0 2 ...
 $ mths_since_rcnt_il        : num  2 14 12 3 23 6 6 10 91 9 ...
 $ total_bal_il              : num  14809 73863 22387 40343 11499 ...
 $ il_util                   : num  99 83 47 92 72 73 61 91 76 67 ...
 $ open_rv_12m               : num  0 0 0 1 1 1 1 2 0 3 ...
 $ max_bal_bc                : num  1007 5109 12211 3694 325 ...
 $ all_util                  : num  88 71 66 84 54 55 60 86 59 50 ...
 $ inq_fi                    : num  3 5 0 0 2 2 0 1 0 1 ...
 $ total_cu_tl               : num  0 1 0 0 2 0 1 1 0 1 ...
 $ inq_last_12m              : num  2 2 0 1 0 3 4 3 1 4 ...
 $ avg_cur_bal               : int  3972 17960 11885 30540 1710 4752 47914 22436 1419 2506 ...
 $ bc_open_to_buy            : num  1623 4833 3393 997 4329 ...
 $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mo_sin_old_il_acct        : num  101 87 145 132 135 65 258 129 153 113 ...
 $ mo_sin_rcnt_rev_tl_op     : int  25 22 26 8 10 10 4 2 17 7 ...
 $ mo_sin_rcnt_tl            : int  2 14 12 3 10 6 4 2 17 7 ...
 $ mort_acc                  : int  0 1 6 4 2 0 4 2 8 0 ...
 $ mths_since_recent_bc      : num  25 22 32 59 10 10 4 89 31 7 ...
 $ mths_since_recent_inq     : num  4 5 20 9 23 6 4 2 11 0 ...
 $ num_accts_ever_120_pd     : int  2 0 0 3 0 0 0 2 1 2 ...
 $ num_actv_bc_tl            : int  2 3 5 3 2 4 3 2 3 4 ...
 $ num_bc_tl                 : int  3 4 9 4 5 4 7 7 7 5 ...
 $ num_il_tl                 : int  7 9 7 8 4 4 8 2 3 7 ...
 $ num_tl_90g_dpd_24m        : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pct_tl_nvr_dlq            : num  83.3 100 93.9 83.3 100 100 100 86.4 91.3 93.3 ...
 $ percent_bc_gt_75          : num  50 33.3 100 100 0 0 0 100 33.3 25 ...
 $ pub_rec_bankruptcies      : int  0 0 0 0 1 0 0 0 3 0 ...
 $ tax_liens                 : int  0 0 0 1 0 0 0 0 1 0 ...
 $ is_ny                     : num  0 1 0 0 0 0 0 0 0 0 ...
 $ is_pa                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_nj                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_oh                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_fl                     : num  0 0 0 0 1 0 0 0 1 0 ...
 $ is_co                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_ga                     : num  1 0 0 0 0 0 0 1 0 0 ...
 $ is_va                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_az                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_ca                     : num  0 0 0 0 0 0 0 0 0 0 ...
>

We will transform annual_increvol_balavg_cur_balbc_open_to_buy by dividing them by funded_amnt (amount of loan).

data_train$annual_inc = data_train$annual_inc/data_train$funded_amnt
data_train$revol_bal = data_train$revol_bal/data_train$funded_amnt
data_train$avg_cur_bal = data_train$avg_cur_bal/data_train$funded_amnt
data_train$bc_open_to_buy = data_train$bc_open_to_buy/data_train$funded_amnt

We can now remove the funded amount attribute.

data_train$funded_amnt = NULL

Character Features

Let’s look at all character features we have left.

This content is for paid members only.

Join our membership for lifelong unlimited access to all our data science learning content and resources.

Related Downloads