Data Cleaning in R - Part 3

Default by States

We take a look at default rate for each state. We filter out states that have too small number of loans(less than 1000):

tmp = data_train %>% filter(loan_status=="Default") %>% group_by(addr_state) %>% summarise(default_count = n())
tmp2 = data_train %>% group_by(addr_state) %>% summarise(count = n())
tmp3 = tmp2 %>% left_join(tmp) %>% mutate(default_rate = default_count/count)
> tmp3
# A tibble: 50 x 4
   addr_state count default_count default_rate
                          
 1 AK           112            42        0.375
 2 AL           557           194        0.348
 3 AR           315           122        0.387
 4 AZ          1090           316        0.290
 5 CA          6173          1837        0.298
 6 CO           952           222        0.233
 7 CT           563           147        0.261
 8 DC            86            14        0.163
 9 DE           118            31        0.263
10 FL          3014           933        0.310
# ... with 40 more rows
> 

Order States by Default Rate

We can order states by default rate to identify states with highest and lowest default rates.

#order by highest default rate

high_default = (tmp3 %>% filter(count > 1000) %>% arrange(desc(default_rate)))[1:10,"addr_state"]$addr_state

high_default
 [1] "NY" "PA" "NJ" "OH" "FL" "IL" "NC" "MI" "TX" "CA"
# order by lowest default rate

low_default = (tmp3 %>% filter(count > 1000) %>% arrange((default_rate)))[1:10,"addr_state"]$addr_state

> low_default
 [1] "CO" "GA" "VA" "AZ" "CA" "TX" "MI" "NC" "IL" "FL"
>

We then create binary variable for 5 highest states and 5 lowest states discard the rest.

This content is for paid members only.

Join our membership for lifelong unlimited access to all our data science learning content and resources.

Related Downloads