Lessons

- Credit Risk Modelling - Case Studies
- Classification vs. Regression Models
- Case Study - German Credit - Steps to Build a Predictive Model
- Import Credit Data Set in R
- German Credit Data : Data Preprocessing and Feature Selection in R
- Credit Modelling: Training and Test Data Sets
- Build the Predictive Model
- Logistic Regression Model in R
- Measure Model Performance in R Using ROCR Package
- Create a Confusion Matrix in R
- Credit Risk Modelling - Case Study- Lending Club Data
- Explore Loan Data in R - Loan Grade and Interest Rate
- Credit Risk Modelling - Required R Packages
- Loan Data - Training and Test Data Sets
- Data Cleaning in R - Part 1
- Data Cleaning in R - Part 2
- Data Cleaning in R - Part 3
- Data Cleaning in R - Part 5
- Remove Dimensions By Fitting Logistic Regression
- Create a Function and Prepare Test Data in R
- Building Credit Risk Model
- Credit Risk - Logistic Regression Model in R
- Support Vector Machine (SVM) Model in R
- Random Forest Model in R
- Extreme Gradient Boosting in R
- Predictive Modelling: Averaging Results from Multiple Models
- Predictive Modelling: Comparing Model Results
- How Insurance Companies Calculate Risk

# Data Cleaning in R - Part 3

### Default by States

We take a look at default rate for each state. We filter out states that have too small number of loans(less than 1000):

```
tmp = data_train %>% filter(loan_status=="Default") %>% group_by(addr_state) %>% summarise(default_count = n())
tmp2 = data_train %>% group_by(addr_state) %>% summarise(count = n())
tmp3 = tmp2 %>% left_join(tmp) %>% mutate(default_rate = default_count/count)
```

```
> tmp3
# A tibble: 50 x 4
addr_state count default_count default_rate
1 AK 112 42 0.375
2 AL 557 194 0.348
3 AR 315 122 0.387
4 AZ 1090 316 0.290
5 CA 6173 1837 0.298
6 CO 952 222 0.233
7 CT 563 147 0.261
8 DC 86 14 0.163
9 DE 118 31 0.263
10 FL 3014 933 0.310
# ... with 40 more rows
>
```

### Order States by Default Rate

We can order states by default rate to identify states with highest and lowest default rates.

```
#order by highest default rate
high_default = (tmp3 %>% filter(count > 1000) %>% arrange(desc(default_rate)))[1:10,"addr_state"]$addr_state
high_default
[1] "NY" "PA" "NJ" "OH" "FL" "IL" "NC" "MI" "TX" "CA"
```

```
# order by lowest default rate
low_default = (tmp3 %>% filter(count > 1000) %>% arrange((default_rate)))[1:10,"addr_state"]$addr_state
> low_default
[1] "CO" "GA" "VA" "AZ" "CA" "TX" "MI" "NC" "IL" "FL"
>
```

We then create binary variable for 5 highest states and 5 lowest states discard the rest.

# This content is for paid members only.

Join our membership for lifelong unlimited access to all our data science learning content and resources.