Data Cleaning in R - Part 1

Discarding Attributes

LendingClub also provides a data dictionary that contains details of all attributes of out dataset. We can use that dictionary to understand more about the data columns we have and remove columns that may not impact the loan default.

Downloads

Discard Attributes

We can use the data dictionary to identify and discard some attributes which we think are irrelevant or will have little impact on loan default.

discard_column = c("collection_recovery_fee","emp_title",
                   "funded_amnt_inv","id",
                   "installment","last_credit_pull_d",
                   "last_fico_range_high","last_fico_range_low",
                   "last_pymnt_amnt","last_pymnt_d",
                   "loan_amnt","member_id",
                   "next_pymnt_d","num_tl_120dpd_2m",
                   "num_tl_30dpd","out_prncp",
                   "out_prncp_inv","recoveries",
                   "total_pymnt","total_pymnt_inv",
                   "total_rec_int","total_rec_late_fee",
                   "total_rec_prncp","url",
                   "zip_code"
                   )
> data_train = (data_train[,!(names(data_train) %in% discard_column)])
> dim(data_train)
[1] 41909   122
>

Discard Grade Attribute

We will also drop the grade attribute as the grade information is also available in sub_grade.

> data_train$grade = NULL

Discard Columns with Too Many NAs

We will also discard columns with too many NAs. We will discard columns where the percentage is higher than 50% NAs.

> data_train <- data_train[, -which(colMeans(is.na(data_train)) > 0.5)]

For the rest of the columns in NAs, it is reasonable to impute missing values with the median value. We will store median_impute_model for further use.

To do so, we will first store the percentage of NA values in each column in tmp variable.

> tmp = sort(sapply(data_train, function(x) sum(length(which(is.na(x)))))/nrow(data_train),decreasing = TRUE)

> median_impute_model = preProcess(data_train[names(tmp)],method="medianImpute")
> data_train = predict(median_impute_model,data_train)

Before proceeding, we will check our data once more to ensure that ensure that there are no attributes with NA values.

> sort(sapply(data_train, function(x) sum(length(which(is.na(x)))))/nrow(data_train),decreasing = TRUE)
               funded_amnt                       term                   int_rate 
                         0                          0                          0 
                 sub_grade                 emp_length             home_ownership 
                         0                          0                          0 
                annual_inc        verification_status                    issue_d 
                         0                          0                          0 
               loan_status                 pymnt_plan                       desc 
                         0                          0                          0 
                   purpose                      title                 addr_state 
                         0                          0                          0 
                       dti                delinq_2yrs           earliest_cr_line 
                         0                          0                          0 
            inq_last_6mths     mths_since_last_delinq                   open_acc 
                         0                          0                          0 
                   pub_rec                  revol_bal                 revol_util 
                         0                          0                          0 
                 total_acc        initial_list_status collections_12_mths_ex_med 
                         0                          0                          0 
               policy_code           application_type  verification_status_joint 
                         0                          0                          0 
            acc_now_delinq               tot_coll_amt                tot_cur_bal 
                         0                          0                          0 
               open_acc_6m                open_act_il                open_il_12m 
                         0                          0                          0 
               open_il_24m         mths_since_rcnt_il               total_bal_il 
                         0                          0                          0 
                   il_util                open_rv_12m                open_rv_24m 
                         0                          0                          0 
                max_bal_bc                   all_util           total_rev_hi_lim 
                         0                          0                          0 
                    inq_fi                total_cu_tl               inq_last_12m 
                         0                          0                          0 
      acc_open_past_24mths                avg_cur_bal             bc_open_to_buy 
                         0                          0                          0 
                   bc_util   chargeoff_within_12_mths                delinq_amnt 
                         0                          0                          0 
        mo_sin_old_il_acct       mo_sin_old_rev_tl_op      mo_sin_rcnt_rev_tl_op 
                         0                          0                          0 
            mo_sin_rcnt_tl                   mort_acc       mths_since_recent_bc 
                         0                          0                          0 
     mths_since_recent_inq      num_accts_ever_120_pd             num_actv_bc_tl 
                         0                          0                          0 
           num_actv_rev_tl                num_bc_sats                  num_bc_tl 
                         0                          0                          0 
                 num_il_tl              num_op_rev_tl              num_rev_accts 
                         0                          0                          0 
       num_rev_tl_bal_gt_0                   num_sats         num_tl_90g_dpd_24m 
                         0                          0                          0 
        num_tl_op_past_12m             pct_tl_nvr_dlq           percent_bc_gt_75 
                         0                          0                          0 
      pub_rec_bankruptcies                  tax_liens            tot_hi_cred_lim 
                         0                          0                          0 
         total_bal_ex_mort             total_bc_limit total_il_high_credit_limit 
                         0                          0                          0 
             hardship_flag              hardship_type            hardship_reason 
                         0                          0                          0 
           hardship_status        hardship_start_date          hardship_end_date 
                         0                          0                          0 
   payment_plan_start_date       hardship_loan_status        disbursement_method 
                         0                          0                          0 
      debt_settlement_flag  debt_settlement_flag_date          settlement_status 
                         0                          0                          0 
           settlement_date 
                         0 
> 

Transformations

Let’s take a look at the remaining features:

str(data_train)
> str(data_train)
'data.frame':    41909 obs. of  94 variables:
 $ funded_amnt               : int  10000 35000 14400 7250 10000 10000 25000 8400 6950 16000 ...
 $ term                      : chr  " 60 months" " 60 months" " 36 months" " 36 months" ...
 $ int_rate                  : num  19.53 20.75 10.75 13.67 5.32 ...
 $ sub_grade                 : chr  "D5" "E2" "B4" "C3" ...
 $ emp_length                : chr  "4 years" "3 years" "10+ years" "3 years" ...
 $ home_ownership            : chr  "OWN" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
 $ annual_inc                : num  52000 85000 85000 72000 45000 ...
 $ verification_status       : chr  "Source Verified" "Source Verified" "Not Verified" "Source Verified" ...
 $ issue_d                   : chr  "Mar-2016" "Mar-2016" "Mar-2016" "Mar-2016" ...
 $ loan_status               : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
 $ pymnt_plan                : chr  "n" "n" "n" "n" ...
 $ desc                      : chr  "" "" "" "" ...
 $ purpose                   : chr  "other" "debt_consolidation" "small_business" "debt_consolidation" ...
 $ title                     : chr  "Other" "Debt consolidation" "Business" "Debt consolidation" ...
 $ addr_state                : chr  "GA" "NY" "NH" "IN" ...
 $ dti                       : num  15 24.98 28.11 23.93 8.03 ...
 $ delinq_2yrs               : int  0 0 0 1 0 0 0 0 0 0 ...
 $ earliest_cr_line          : chr  "Oct-2000" "Dec-2008" "Feb-2000" "Jan-1985" ...
 $ inq_last_6mths            : int  1 1 0 0 0 1 1 1 0 1 ...
 $ mths_since_last_delinq    : num  44 31 72 20 31 31 31 60 55 25 ...
 $ open_acc                  : int  4 9 16 12 7 9 9 7 5 10 ...
 $ pub_rec                   : int  2 0 0 1 1 0 0 0 4 0 ...
 $ revol_bal                 : int  1077 10167 37582 12220 471 10139 47954 11059 7096 9891 ...
 $ revol_util                : chr  "35.9%" "46.6%" "85.8%" "65%" ...
 $ total_acc                 : int  12 19 33 24 15 14 22 22 26 16 ...
 $ initial_list_status       : chr  "w" "w" "w" "w" ...
 $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 1 0 0 ...
 $ policy_code               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ application_type          : chr  "Individual" "Individual" "Individual" "Individual" ...
 $ verification_status_joint : chr  "" "" "" "" ...
 $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tot_coll_amt              : int  622 0 0 0 0 0 0 0 0 2017 ...
 $ tot_cur_bal               : int  15886 161643 190166 366483 11970 42767 431227 157051 7096 22551 ...
 $ open_acc_6m               : num  2 0 0 1 0 1 2 1 0 0 ...
 $ open_act_il               : num  1 3 3 2 1 2 2 1 0 2 ...
 $ open_il_12m               : num  4 0 1 1 0 1 2 1 0 2 ...
 $ open_il_24m               : num  5 3 1 3 1 1 2 1 0 2 ...
 $ mths_since_rcnt_il        : num  2 14 12 3 23 6 6 10 91 9 ...
 $ total_bal_il              : num  14809 73863 22387 40343 11499 ...
 $ il_util                   : num  99 83 47 92 72 73 61 91 76 67 ...
 $ open_rv_12m               : num  0 0 0 1 1 1 1 2 0 3 ...
 $ open_rv_24m               : num  0 1 0 1 2 1 1 2 1 4 ...
 $ max_bal_bc                : num  1007 5109 12211 3694 325 ...
 $ all_util                  : num  88 71 66 84 54 55 60 86 59 50 ...
 $ total_rev_hi_lim          : int  3000 21800 43800 18800 6400 33000 81500 14100 12000 26700 ...
 $ inq_fi                    : num  3 5 0 0 2 2 0 1 0 1 ...
 $ total_cu_tl               : num  0 1 0 0 2 0 1 1 0 1 ...
 $ inq_last_12m              : num  2 2 0 1 0 3 4 3 1 4 ...
 $ acc_open_past_24mths      : int  5 4 1 4 3 2 3 4 1 6 ...
 $ avg_cur_bal               : int  3972 17960 11885 30540 1710 4752 47914 22436 1419 2506 ...
 $ bc_open_to_buy            : num  1623 4833 3393 997 4329 ...
 $ bc_util                   : num  39.9 67.8 87.4 90.1 9.8 50.7 35.8 90.6 69.8 49.5 ...
 $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mo_sin_old_il_acct        : num  101 87 145 132 135 65 258 129 153 113 ...
 $ mo_sin_old_rev_tl_op      : int  185 65 193 374 310 65 246 129 205 184 ...
 $ mo_sin_rcnt_rev_tl_op     : int  25 22 26 8 10 10 4 2 17 7 ...
 $ mo_sin_rcnt_tl            : int  2 14 12 3 10 6 4 2 17 7 ...
 $ mort_acc                  : int  0 1 6 4 2 0 4 2 8 0 ...
 $ mths_since_recent_bc      : num  25 22 32 59 10 10 4 89 31 7 ...
 $ mths_since_recent_inq     : num  4 5 20 9 23 6 4 2 11 0 ...
 $ num_accts_ever_120_pd     : int  2 0 0 3 0 0 0 2 1 2 ...
 $ num_actv_bc_tl            : int  2 3 5 3 2 4 3 2 3 4 ...
 $ num_actv_rev_tl           : int  2 3 11 5 2 4 4 4 4 6 ...
 $ num_bc_sats               : int  2 3 5 3 4 4 5 2 3 5 ...
 $ num_bc_tl                 : int  3 4 9 4 5 4 7 7 7 5 ...
 $ num_il_tl                 : int  7 9 7 8 4 4 8 2 3 7 ...
 $ num_op_rev_tl             : int  3 5 11 9 6 7 6 5 5 8 ...
 $ num_rev_accts             : int  5 9 20 12 9 10 10 12 15 9 ...
 $ num_rev_tl_bal_gt_0       : int  2 3 11 5 2 4 4 4 4 6 ...
 $ num_sats                  : int  4 9 16 12 7 9 9 7 5 10 ...
 $ num_tl_90g_dpd_24m        : int  0 0 0 1 0 0 0 0 0 0 ...
 $ num_tl_op_past_12m        : int  4 0 1 2 1 2 3 3 0 5 ...
 $ pct_tl_nvr_dlq            : num  83.3 100 93.9 83.3 100 100 100 86.4 91.3 93.3 ...
 $ percent_bc_gt_75          : num  50 33.3 100 100 0 0 0 100 33.3 25 ...
 $ pub_rec_bankruptcies      : int  0 0 0 0 1 0 0 0 3 0 ...
 $ tax_liens                 : int  0 0 0 1 0 0 0 0 1 0 ...
 $ tot_hi_cred_lim           : int  18000 193751 232277 374306 22363 77555 499524 166260 12000 45520 ...
 $ total_bal_ex_mort         : int  15886 84030 59969 52563 11970 42767 67575 30290 7096 22551 ...
 $ total_bc_limit            : int  2700 15000 27000 10100 4800 20000 52500 6200 9700 18800 ...
 $ total_il_high_credit_limit: int  15000 87951 47352 43666 15963 44555 32024 21079 0 18820 ...
 $ hardship_flag             : chr  "N" "N" "N" "N" ...
 $ hardship_type             : chr  "" "" "" "" ...
 $ hardship_reason           : chr  "" "" "" "" ...
 $ hardship_status           : chr  "" "" "" "" ...
 $ hardship_start_date       : chr  "" "" "" "" ...
 $ hardship_end_date         : chr  "" "" "" "" ...
 $ payment_plan_start_date   : chr  "" "" "" "" ...
 $ hardship_loan_status      : chr  "" "" "" "" ...
 $ disbursement_method       : chr  "Cash" "Cash" "Cash" "Cash" ...
 $ debt_settlement_flag      : chr  "N" "N" "N" "N" ...
 $ debt_settlement_flag_date : chr  "" "" "" "" ...
 $ settlement_status         : chr  "" "" "" "" ...
 $ settlement_date           : chr  "" "" "" "" ...
>

Remove More Columns

discard_column = c("hardship_flag","hardship_type",
                   "hardship_reason","hardship_status",
                   "hardship_start_date","hardship_end_date",
                   "payment_plan_start_date","hardship_loan_status",
                   "disbursement_method","debt_settlement_flag",
                   "debt_settlement_flag_date","settlement_status",
                   "settlement_date"
                   )
> data_train = (data_train[,!(names(data_train) %in% discard_column)])

revol_util

We will convert revol_util to numeric.

data_train$revol_util = (as.numeric(gsub(pattern = "%",replacement = "",x = data_train$int_rate)))

earliest_cr_line

earliest_cr_line is transformed to the number of days before the loan is issued.

data_train$earliest_cr_line = parse_date_time(str_c("01",data_train$issue_d),"dmy" ) - parse_date_time(str_c("01",data_train$earliest_cr_line),"dmy" )
data_train$earliest_cr_line = as.numeric(data_train$earliest_cr_line,units = "days")

Extract Issued Month

We can see that the default rate doesn’t vary much by the month it is issued. We will drop issue_d

data_train$issue_m = sapply( data_train$issue_d ,function(x){str_split(x,"-")[[1]][1]})
tmp = data_train %>% filter(loan_status=="Default") %>% group_by(issue_m) %>% summarise(default_count = n())
tmp2 = data_train %>% group_by(issue_m) %>% summarise(count = n())
tmp2 %>% left_join(tmp) %>% mutate(default_rate = default_count/count)
Joining, by = "issue_m"
# A tibble: 3 x 4
  issue_m count default_count default_rate
                      
1 Feb     12470          3779        0.303
2 Jan     10664          3090        0.290
3 Mar     18775          5761        0.307
> 
data_train$issue_m = NULL
data_train$issue_d = NULL
rm(tmp,tmp2)

Lesson Resources

All Users

Related Downloads

Membership
Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.
I WANT TO JOIN
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.