Data Cleaning in R - Part 1

Discarding Attributes

LendingClub also provides a data dictionary that contains details of all attributes of out dataset. We can use that dictionary to understand more about the data columns we have and remove columns that may not impact the loan default.

Downloads

Discard Attributes

We can use the data dictionary to identify and discard some attributes which we think are irrelevant or will have little impact on loan default.

discard_column = c("collection_recovery_fee","emp_title",
                   "funded_amnt_inv","id",
                   "installment","last_credit_pull_d",
                   "last_fico_range_high","last_fico_range_low",
                   "last_pymnt_amnt","last_pymnt_d",
                   "loan_amnt","member_id",
                   "next_pymnt_d","num_tl_120dpd_2m",
                   "num_tl_30dpd","out_prncp",
                   "out_prncp_inv","recoveries",
                   "total_pymnt","total_pymnt_inv",
                   "total_rec_int","total_rec_late_fee",
                   "total_rec_prncp","url",
                   "zip_code"
                   )
> data_train = (data_train[,!(names(data_train) %in% discard_column)])
> dim(data_train)
[1] 41909   122
>

Discard Grade Attribute

We will also drop the grade attribute as the grade information is also available in sub_grade.

This content is for paid members only.

Join our membership for lifelong unlimited access to all our data science learning content and resources.

Lesson Resources

All Users

Related Downloads