We will use the preProcess function from the caret package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value. We can try to remove the number of dimensions […]
Data Science
Data Cleaning in R – Part 5
Numeric Features Let’s look at all numeric features we have left. We will transform annual_inc, revol_bal, avg_cur_bal, bc_open_to_buy by dividing them by funded_amnt (amount of loan). We can now remove the funded amount attribute. Character Features Let’s look at all character features we have left. We will remove verification_status_joint. Let us look at home_ownership data. There are only three options, MORTGAGE, OWN and RENT. Even […]
Data Cleaning in R – Part 3
Default by States We take a look at default rate for each state. We filter out states that have too small number of loans(less than 1000): Order States by Default Rate We can order states by default rate to identify states with highest and lowest default rates. We then create binary variable for 5 highest […]
Data Cleaning in R – Part 2
Attributes with Zero Variance Datasets can sometimes contain attributes (predictors) that have near-zero variance, or may have just one value. Such variables are considered to have less predictor power. Apart from being uninformative, these predictors may also sometimes break the model that you are trying to fit to your data. This can occur, for example, […]
Data Cleaning in R – Part 1
Discarding Attributes LendingClub also provides a data dictionary that contains details of all attributes of out dataset. We can use that dictionary to understand more about the data columns we have and remove columns that may not impact the loan default. Discard Attributes We can use the data dictionary to identify and discard some attributes […]
Loan Data – Training and Test Data Sets
For building the model, we will divide our data into two different data sets, namely training and testing datasets. The model will be built using the training set and then we will test it on the testing set to evaluate how our model is performing. There are many ways in which we can split the […]
Credit Risk Modelling – Required R Packages
During our analysis, we will make use of various R packages. So, let’s look at what these packages are and let’s install and load them in R. Dplyr ‘Dplyr’ provides a set of tools for efficiently manipulating datasets in R. The problem in most data analyses is the time it takes for you to figure […]
Explore Loan Data in R – Loan Grade and Interest Rate
There is no set path to how one would go about analyzing a data set. Typically, a data scientist would spend quite some time exploring and observing the data to understand it well. Let’s look at some of the attributes in our dataset and see their relationship with the default rate. For each loan, we […]
Explore Financial Data in R
Now that we have the data file in our working directory, we can load it in our R session and start exploring it. Use the following command to load the data into R. The “stringsAsFactors = FALSE” parameter turns off the automatic conversion of character strings to factors. Let’s set the seed to 0 so […]
Credit Risk Modelling – Case Study- Lending Club Data
To build a good model, it is important to use high quality data. For the purpose of this course, we will use the loan data available From LendingClub’s website. LendingClub is a US peer-to-peer lending company which matches borrowers with investors willing to fund their loans. LendingClub provides rich datasets a which are available to […]