Now, we will tune RandomForest model. Like SVM, we tune parameter based on 5% downsampling data. The procedure is exactly the same as for SVM model. Below we have reproduced the code for Random Forest model. The best parameter is mtry(number of predictors) = 2. Like SVM, we fit 10% of downsampling data with this […]

## Support Vector Machine (SVM) Model in R

A support vector machine (SVM) is a supervised learning technique that analyzes data and isolates patterns applicable to both classification and regression. The classifier is useful for choosing between two or more possible outcomes that depend on continuous or categorical predictor variables. Based on training and sample classification data, the SVM algorithm assigns the target […]

## Credit Risk – Logistic Regression Model in R

To build our first model, we will tune Logistic Regression to our training dataset. First we set the seed (to any number. we have chosen 100) so that we can reproduce our results. Then we create a downsampled dataset called samp which contains an equal number of Default and Fully Paid loans. We can use the table() function to check that the downsampling […]

## Building Credit Risk Model

The loan data typically have a higher proportion of good loans. We can achieve high accuracy just by labeling all loans as Fully Paid. For our test data, we gain 70.3% accuracy by just following the above strategy. Recall that we are yet to include the outcome of ‘Current’ loans. In a real situation, the ratio […]

## Create a Function and Prepare Test Data in R

When we build the model, we will need the same set of columns in the test data also and will also need to apply all the same transformations that we have done to the test_data also. Kept Columns Create Function Prepare Test Data We will now take our test data and apply our data transformations to it. […]

## Remove Dimensions By Fitting Logistic Regression

We will use the preProcess function from the caret package to center and scale (Normalize) the data. The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation. The center transform calculates the mean for an attribute and subtracts it from each value. We can try to remove the number of dimensions […]

## Data Cleaning in R – Part 5

Numeric Features Let’s look at all numeric features we have left. We will transform annual_inc, revol_bal, avg_cur_bal, bc_open_to_buy by dividing them by funded_amnt (amount of loan). We can now remove the funded amount attribute. Character Features Let’s look at all character features we have left. We will remove verification_status_joint. Let us look at home_ownership data. There are only three options, MORTGAGE, OWN and RENT. Even […]

## Data Cleaning in R – Part 3

Default by States We take a look at default rate for each state. We filter out states that have too small number of loans(less than 1000): Order States by Default Rate We can order states by default rate to identify states with highest and lowest default rates. We then create binary variable for 5 highest […]

## Data Cleaning in R – Part 2

Attributes with Zero Variance Datasets can sometimes contain attributes (predictors) that have near-zero variance, or may have just one value. Such variables are considered to have less predictor power. Apart from being uninformative, these predictors may also sometimes break the model that you are trying to fit to your data. This can occur, for example, […]

## Data Cleaning in R – Part 1

Discarding Attributes LendingClub also provides a data dictionary that contains details of all attributes of out dataset. We can use that dictionary to understand more about the data columns we have and remove columns that may not impact the loan default. Discard Attributes We can use the data dictionary to identify and discard some attributes […]