• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Finance Train

Finance Train

High Quality tutorials for finance, risk, data science

  • Home
  • Data Science
  • CFA® Exam
  • PRM Exam
  • Tutorials
  • Careers
  • Products
  • Login

Data Cleaning in R – Part 5

Data Science, Risk Management

This lesson is part 18 of 28 in the course Credit Risk Modelling in R

Numeric Features

Let’s look at all numeric features we have left.

> str(data_train[getNumericColumns(data_train)])
'data.frame':    41909 obs. of  54 variables:
 $ funded_amnt               : int  10000 35000 14400 7250 10000 10000 25000 8400 6950 16000 ...
 $ annual_inc                : num  52000 85000 85000 72000 45000 ...
 $ dti                       : num  15 24.98 28.11 23.93 8.03 ...
 $ delinq_2yrs               : int  0 0 0 1 0 0 0 0 0 0 ...
 $ earliest_cr_line          : num  5630 2647 5873 11382 9436 ...
 $ inq_last_6mths            : int  1 1 0 0 0 1 1 1 0 1 ...
 $ mths_since_last_delinq    : num  44 31 72 20 31 31 31 60 55 25 ...
 $ pub_rec                   : int  2 0 0 1 1 0 0 0 4 0 ...
 $ revol_bal                 : int  1077 10167 37582 12220 471 10139 47954 11059 7096 9891 ...
 $ revol_util                : num  19.53 20.75 10.75 13.67 5.32 ...
 $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 1 0 0 ...
 $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tot_coll_amt              : int  622 0 0 0 0 0 0 0 0 2017 ...
 $ open_acc_6m               : num  2 0 0 1 0 1 2 1 0 0 ...
 $ open_act_il               : num  1 3 3 2 1 2 2 1 0 2 ...
 $ open_il_12m               : num  4 0 1 1 0 1 2 1 0 2 ...
 $ mths_since_rcnt_il        : num  2 14 12 3 23 6 6 10 91 9 ...
 $ total_bal_il              : num  14809 73863 22387 40343 11499 ...
 $ il_util                   : num  99 83 47 92 72 73 61 91 76 67 ...
 $ open_rv_12m               : num  0 0 0 1 1 1 1 2 0 3 ...
 $ max_bal_bc                : num  1007 5109 12211 3694 325 ...
 $ all_util                  : num  88 71 66 84 54 55 60 86 59 50 ...
 $ inq_fi                    : num  3 5 0 0 2 2 0 1 0 1 ...
 $ total_cu_tl               : num  0 1 0 0 2 0 1 1 0 1 ...
 $ inq_last_12m              : num  2 2 0 1 0 3 4 3 1 4 ...
 $ avg_cur_bal               : int  3972 17960 11885 30540 1710 4752 47914 22436 1419 2506 ...
 $ bc_open_to_buy            : num  1623 4833 3393 997 4329 ...
 $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mo_sin_old_il_acct        : num  101 87 145 132 135 65 258 129 153 113 ...
 $ mo_sin_rcnt_rev_tl_op     : int  25 22 26 8 10 10 4 2 17 7 ...
 $ mo_sin_rcnt_tl            : int  2 14 12 3 10 6 4 2 17 7 ...
 $ mort_acc                  : int  0 1 6 4 2 0 4 2 8 0 ...
 $ mths_since_recent_bc      : num  25 22 32 59 10 10 4 89 31 7 ...
 $ mths_since_recent_inq     : num  4 5 20 9 23 6 4 2 11 0 ...
 $ num_accts_ever_120_pd     : int  2 0 0 3 0 0 0 2 1 2 ...
 $ num_actv_bc_tl            : int  2 3 5 3 2 4 3 2 3 4 ...
 $ num_bc_tl                 : int  3 4 9 4 5 4 7 7 7 5 ...
 $ num_il_tl                 : int  7 9 7 8 4 4 8 2 3 7 ...
 $ num_tl_90g_dpd_24m        : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pct_tl_nvr_dlq            : num  83.3 100 93.9 83.3 100 100 100 86.4 91.3 93.3 ...
 $ percent_bc_gt_75          : num  50 33.3 100 100 0 0 0 100 33.3 25 ...
 $ pub_rec_bankruptcies      : int  0 0 0 0 1 0 0 0 3 0 ...
 $ tax_liens                 : int  0 0 0 1 0 0 0 0 1 0 ...
 $ is_ny                     : num  0 1 0 0 0 0 0 0 0 0 ...
 $ is_pa                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_nj                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_oh                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_fl                     : num  0 0 0 0 1 0 0 0 1 0 ...
 $ is_co                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_ga                     : num  1 0 0 0 0 0 0 1 0 0 ...
 $ is_va                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_az                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_ca                     : num  0 0 0 0 0 0 0 0 0 0 ...
>

We will transform annual_inc, revol_bal, avg_cur_bal, bc_open_to_buy by dividing them by funded_amnt (amount of loan).

data_train$annual_inc = data_train$annual_inc/data_train$funded_amnt
data_train$revol_bal = data_train$revol_bal/data_train$funded_amnt
data_train$avg_cur_bal = data_train$avg_cur_bal/data_train$funded_amnt
data_train$bc_open_to_buy = data_train$bc_open_to_buy/data_train$funded_amnt

We can now remove the funded amount attribute.

data_train$funded_amnt = NULL

Character Features

Let’s look at all character features we have left.

> str(data_train[getCharColumns(data_train)])
'data.frame':    41909 obs. of  11 variables:
 $ term                     : chr  " 60 months" " 60 months" " 36 months" " 36 months" ...
 $ sub_grade                : chr  "D5" "E2" "B4" "C3" ...
 $ emp_length               : chr  "4 years" "3 years" "10+ years" "3 years" ...
 $ home_ownership           : chr  "OWN" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
 $ verification_status      : chr  "Source Verified" "Source Verified" "Not Verified" "Source Verified" ...
 $ loan_status              : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
 $ pymnt_plan               : chr  "n" "n" "n" "n" ...
 $ purpose                  : chr  "other" "debt_consolidation" "small_business" "debt_consolidation" ...
 $ initial_list_status      : chr  "w" "w" "w" "w" ...
 $ application_type         : chr  "Individual" "Individual" "Individual" "Individual" ...
 $ verification_status_joint: chr  "" "" "" "" ...
>

We will remove verification_status_joint.

data_train$verification_status_joint = NULL

Let us look at home_ownership data.

> table(data_train$home_ownership)
MORTGAGE      OWN     RENT 
   21105     5091    15713

There are only three options, MORTGAGE, OWN and RENT.

Even though there are no other values, we will still create a filter to only take observations that have value “MORTGAGE”,“OWN”, or “RENT” as these are only values that appear.

We see there are 64 loans with pymnt_plan="y", all ended in being default.

> data_train %>% filter(pymnt_plan=="y") %>% select(pymnt_plan, loan_status)
   pymnt_plan loan_status
1           y     Default
2           y     Default
3           y     Default
4           y     Default
5           y     Default
6           y     Default
7           y     Default
8           y     Default
9           y     Default
10          y     Default
11          y     Default

However, in a dataset of 41909 records, 64 is too small a number of records to make a conclusion. For our purpose, we will remove payment_plan

data_train$pymnt_plan = NULL

We will transform all character predictor features to binary dummy variables.

> loan_status = data_train$loan_status
> table(loan_status)
loan_status
   Default Fully Paid 
     12678      29231 
>

As the name implies, the dummyVars function allows you to create dummy variables – in other words it translates text data into numerical data for modeling purposes.

If you are planning on doing predictive analytics or machine learning and want to use regression or any other modeling technique that requires numerical data, you will need to transform your text data into numbers otherwise you run the risk of leaving a lot of information on the table.

> dummy_model = dummyVars(loan_status ~ .,data_train,fullRank = TRUE)
> data_train = as.data.frame(predict(dummy_model,data_train))
> data_train$loan_status = loan_status
> rm(loan_status)
#set loan with status 'Fully Paid' as a positive sample
> data_train$loan_status = ifelse(data_train$loan_status == "Fully Paid","Fully.Paid",data_train$loan_status)
> data_train$loan_status = factor(data_train$loan_status,levels = c("Default",”Fully.Paid"))
Previous Lesson

‹ Data Cleaning in R – Part 3

Next Lesson

Remove Dimensions By Fitting Logistic Regression ›

Join Our Facebook Group - Finance, Risk and Data Science

Posts You May Like

How to Improve your Financial Health

CFA® Exam Overview and Guidelines (Updated for 2021)

Changing Themes (Look and Feel) in ggplot2 in R

Coordinates in ggplot2 in R

Facets for ggplot2 Charts in R (Faceting Layer)

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

In this Course

  • Credit Risk Modelling – Case Studies
  • Classification vs. Regression Models
  • Case Study – German Credit – Steps to Build a Predictive Model
  • Import Credit Data Set in R
  • German Credit Data : Data Preprocessing and Feature Selection in R
  • Credit Modelling: Training and Test Data Sets
  • Build the Predictive Model
  • Logistic Regression Model in R
  • Measure Model Performance in R Using ROCR Package
  • Create a Confusion Matrix in R
  • Credit Risk Modelling – Case Study- Lending Club Data
  • Explore Loan Data in R – Loan Grade and Interest Rate
  • Credit Risk Modelling – Required R Packages
  • Loan Data – Training and Test Data Sets
  • Data Cleaning in R – Part 1
  • Data Cleaning in R – Part 2
  • Data Cleaning in R – Part 3
  • Data Cleaning in R – Part 5
  • Remove Dimensions By Fitting Logistic Regression
  • Create a Function and Prepare Test Data in R
  • Building Credit Risk Model
  • Credit Risk – Logistic Regression Model in R
  • Support Vector Machine (SVM) Model in R
  • Random Forest Model in R
  • Extreme Gradient Boosting in R
  • Predictive Modelling: Averaging Results from Multiple Models
  • Predictive Modelling: Comparing Model Results
  • How Insurance Companies Calculate Risk

Latest Tutorials

    • Data Visualization with R
    • Derivatives with R
    • Machine Learning in Finance Using Python
    • Credit Risk Modelling in R
    • Quantitative Trading Strategies in R
    • Financial Time Series Analysis in R
    • VaR Mapping
    • Option Valuation
    • Financial Reporting Standards
    • Fraud
Facebook Group

Membership

Unlock full access to Finance Train and see the entire library of member-only content and resources.

Subscribe

Footer

Recent Posts

  • How to Improve your Financial Health
  • CFA® Exam Overview and Guidelines (Updated for 2021)
  • Changing Themes (Look and Feel) in ggplot2 in R
  • Coordinates in ggplot2 in R
  • Facets for ggplot2 Charts in R (Faceting Layer)

Products

  • Level I Authority for CFA® Exam
  • CFA Level I Practice Questions
  • CFA Level I Mock Exam
  • Level II Question Bank for CFA® Exam
  • PRM Exam 1 Practice Question Bank
  • All Products

Quick Links

  • Privacy Policy
  • Contact Us

CFA Institute does not endorse, promote or warrant the accuracy or quality of Finance Train. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

Copyright © 2021 Finance Train. All rights reserved.

  • About Us
  • Privacy Policy
  • Contact Us