Lessons

- Credit Risk Modelling - Case Studies
- Classification vs. Regression Models
- Case Study - German Credit - Steps to Build a Predictive Model
- Import Credit Data Set in R
- German Credit Data : Data Preprocessing and Feature Selection in R
- Credit Modelling: Training and Test Data Sets
- Build the Predictive Model
- Logistic Regression Model in R
- Measure Model Performance in R Using ROCR Package
- Create a Confusion Matrix in R
- Credit Risk Modelling - Case Study- Lending Club Data
- Explore Loan Data in R - Loan Grade and Interest Rate
- Credit Risk Modelling - Required R Packages
- Loan Data - Training and Test Data Sets
- Data Cleaning in R - Part 1
- Data Cleaning in R - Part 2
- Data Cleaning in R - Part 3
- Data Cleaning in R - Part 5
- Remove Dimensions By Fitting Logistic Regression
- Create a Function and Prepare Test Data in R
- Building Credit Risk Model
- Credit Risk - Logistic Regression Model in R
- Support Vector Machine (SVM) Model in R
- Random Forest Model in R
- Extreme Gradient Boosting in R
- Predictive Modelling: Averaging Results from Multiple Models
- Predictive Modelling: Comparing Model Results
- How Insurance Companies Calculate Risk

# Data Cleaning in R - Part 1

### Discarding Attributes

LendingClub also provides a data dictionary that contains details of all attributes of out dataset. We can use that dictionary to understand more about the data columns we have and remove columns that may not impact the loan default.

Downloads

### Discard Attributes

We can use the data dictionary to identify and discard some attributes which we think are irrelevant or will have little impact on loan default.

```
discard_column = c("collection_recovery_fee","emp_title",
"funded_amnt_inv","id",
"installment","last_credit_pull_d",
"last_fico_range_high","last_fico_range_low",
"last_pymnt_amnt","last_pymnt_d",
"loan_amnt","member_id",
"next_pymnt_d","num_tl_120dpd_2m",
"num_tl_30dpd","out_prncp",
"out_prncp_inv","recoveries",
"total_pymnt","total_pymnt_inv",
"total_rec_int","total_rec_late_fee",
"total_rec_prncp","url",
"zip_code"
)
```

```
> data_train = (data_train[,!(names(data_train) %in% discard_column)])
> dim(data_train)
[1] 41909 122
>
```

### Discard Grade Attribute

We will also drop the `grade`

attribute as the grade information is also available in `sub_grade`

.

# This content is for paid members only.

Join our membership for lifelong unlimited access to all our data science learning content and resources.

#### Lesson Resources

All Users