Data Cleaning in R - Part 5
Numeric Features
Let’s look at all numeric features we have left.
1> str(data_train[getNumericColumns(data_train)])
2'data.frame': 41909 obs. of 54 variables:
3 $ funded_amnt : int 10000 35000 14400 7250 10000 10000 25000 8400 6950 16000 ...
4 $ annual_inc : num 52000 85000 85000 72000 45000 ...
5 $ dti : num 15 24.98 28.11 23.93 8.03 ...
6 $ delinq_2yrs : int 0 0 0 1 0 0 0 0 0 0 ...
7 $ earliest_cr_line : num 5630 2647 5873 11382 9436 ...
8 $ inq_last_6mths : int 1 1 0 0 0 1 1 1 0 1 ...
9 $ mths_since_last_delinq : num 44 31 72 20 31 31 31 60 55 25 ...
10 $ pub_rec : int 2 0 0 1 1 0 0 0 4 0 ...
11 $ revol_bal : int 1077 10167 37582 12220 471 10139 47954 11059 7096 9891 ...
12 $ revol_util : num 19.53 20.75 10.75 13.67 5.32 ...
13 $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 1 0 0 ...
14 $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
15 $ tot_coll_amt : int 622 0 0 0 0 0 0 0 0 2017 ...
16 $ open_acc_6m : num 2 0 0 1 0 1 2 1 0 0 ...
17 $ open_act_il : num 1 3 3 2 1 2 2 1 0 2 ...
18 $ open_il_12m : num 4 0 1 1 0 1 2 1 0 2 ...
19 $ mths_since_rcnt_il : num 2 14 12 3 23 6 6 10 91 9 ...
20 $ total_bal_il : num 14809 73863 22387 40343 11499 ...
21 $ il_util : num 99 83 47 92 72 73 61 91 76 67 ...
22 $ open_rv_12m : num 0 0 0 1 1 1 1 2 0 3 ...
23 $ max_bal_bc : num 1007 5109 12211 3694 325 ...
24 $ all_util : num 88 71 66 84 54 55 60 86 59 50 ...
25 $ inq_fi : num 3 5 0 0 2 2 0 1 0 1 ...
26 $ total_cu_tl : num 0 1 0 0 2 0 1 1 0 1 ...
27 $ inq_last_12m : num 2 2 0 1 0 3 4 3 1 4 ...
28 $ avg_cur_bal : int 3972 17960 11885 30540 1710 4752 47914 22436 1419 2506 ...
29 $ bc_open_to_buy : num 1623 4833 3393 997 4329 ...
30 $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
31 $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
32 $ mo_sin_old_il_acct : num 101 87 145 132 135 65 258 129 153 113 ...
33 $ mo_sin_rcnt_rev_tl_op : int 25 22 26 8 10 10 4 2 17 7 ...
34 $ mo_sin_rcnt_tl : int 2 14 12 3 10 6 4 2 17 7 ...
35 $ mort_acc : int 0 1 6 4 2 0 4 2 8 0 ...
36 $ mths_since_recent_bc : num 25 22 32 59 10 10 4 89 31 7 ...
37 $ mths_since_recent_inq : num 4 5 20 9 23 6 4 2 11 0 ...
38 $ num_accts_ever_120_pd : int 2 0 0 3 0 0 0 2 1 2 ...
39 $ num_actv_bc_tl : int 2 3 5 3 2 4 3 2 3 4 ...
40 $ num_bc_tl : int 3 4 9 4 5 4 7 7 7 5 ...
41 $ num_il_tl : int 7 9 7 8 4 4 8 2 3 7 ...
42 $ num_tl_90g_dpd_24m : int 0 0 0 1 0 0 0 0 0 0 ...
43 $ pct_tl_nvr_dlq : num 83.3 100 93.9 83.3 100 100 100 86.4 91.3 93.3 ...
44 $ percent_bc_gt_75 : num 50 33.3 100 100 0 0 0 100 33.3 25 ...
45 $ pub_rec_bankruptcies : int 0 0 0 0 1 0 0 0 3 0 ...
46 $ tax_liens : int 0 0 0 1 0 0 0 0 1 0 ...
47 $ is_ny : num 0 1 0 0 0 0 0 0 0 0 ...
48 $ is_pa : num 0 0 0 0 0 0 0 0 0 0 ...
49 $ is_nj : num 0 0 0 0 0 0 0 0 0 0 ...
50 $ is_oh : num 0 0 0 0 0 0 0 0 0 0 ...
51 $ is_fl : num 0 0 0 0 1 0 0 0 1 0 ...
52 $ is_co : num 0 0 0 0 0 0 0 0 0 0 ...
53 $ is_ga : num 1 0 0 0 0 0 0 1 0 0 ...
54 $ is_va : num 0 0 0 0 0 0 0 0 0 0 ...
55 $ is_az : num 0 0 0 0 0 0 0 0 0 0 ...
56 $ is_ca : num 0 0 0 0 0 0 0 0 0 0 ...
57>
58
We will transform annual_inc
, revol_bal
, avg_cur_bal
, bc_open_to_buy
by dividing them by funded_amnt
(amount of loan).
1data_train$annual_inc = data_train$annual_inc/data_train$funded_amnt
2data_train$revol_bal = data_train$revol_bal/data_train$funded_amnt
3data_train$avg_cur_bal = data_train$avg_cur_bal/data_train$funded_amnt
4data_train$bc_open_to_buy = data_train$bc_open_to_buy/data_train$funded_amnt
5
We can now remove the funded amount attribute.
data_train$funded_amnt = NULL
Character Features
Let’s look at all character features we have left.
1> str(data_train[getCharColumns(data_train)])
2'data.frame': 41909 obs. of 11 variables:
3 $ term : chr " 60 months" " 60 months" " 36 months" " 36 months" ...
4 $ sub_grade : chr "D5" "E2" "B4" "C3" ...
5 $ emp_length : chr "4 years" "3 years" "10+ years" "3 years" ...
6 $ home_ownership : chr "OWN" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
7 $ verification_status : chr "Source Verified" "Source Verified" "Not Verified" "Source Verified" ...
8 $ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
9 $ pymnt_plan : chr "n" "n" "n" "n" ...
10 $ purpose : chr "other" "debt_consolidation" "small_business" "debt_consolidation" ...
11 $ initial_list_status : chr "w" "w" "w" "w" ...
12 $ application_type : chr "Individual" "Individual" "Individual" "Individual" ...
13 $ verification_status_joint: chr "" "" "" "" ...
14>
15
We will remove verification_status_joint
.
data_train$verification_status_joint = NULL
Let us look at home_ownership
data.
1> table(data_train$home_ownership)
2MORTGAGE OWN RENT
3 21105 5091 15713
4
There are only three options, MORTGAGE, OWN and RENT.
Even though there are no other values, we will still create a filter to only take observations that have value “MORTGAGE”,“OWN”, or “RENT” as these are only values that appear.
We see there are 64 loans with pymnt_plan="y"
, all ended in being default.
1> data_train %>% filter(pymnt_plan=="y") %>% select(pymnt_plan, loan_status)
2 pymnt_plan loan_status
31 y Default
42 y Default
53 y Default
64 y Default
75 y Default
86 y Default
97 y Default
108 y Default
119 y Default
1210 y Default
1311 y Default
14
However, in a dataset of 41909 records, 64 is too small a number of records to make a conclusion. For our purpose, we will remove payment_plan
data_train$pymnt_plan = NULL
We will transform all character predictor features to binary dummy variables.
1> loan_status = data_train$loan_status
2> table(loan_status)
3loan_status
4 Default Fully Paid
5 12678 29231
6>
7
As the name implies, the dummyVars
function allows you to create dummy variables - in other words it translates text data into numerical data for modeling purposes.
If you are planning on doing predictive analytics or machine learning and want to use regression or any other modeling technique that requires numerical data, you will need to transform your text data into numbers otherwise you run the risk of leaving a lot of information on the table.
1> dummy_model = dummyVars(loan_status ~ .,data_train,fullRank = TRUE)
2> data_train = as.data.frame(predict(dummy_model,data_train))
3> data_train$loan_status = loan_status
4> rm(loan_status)
5#set loan with status 'Fully Paid' as a positive sample
6> data_train$loan_status = ifelse(data_train$loan_status == "Fully Paid","Fully.Paid",data_train$loan_status)
7> data_train$loan_status = factor(data_train$loan_status,levels = c("Default",”Fully.Paid"))
8