Loan Data - Training and Test Data Sets

For building the model, we will divide our data into two different data sets, namely training and testing datasets. The model will be built using the training set and then we will test it on the testing set to evaluate how our model is performing.

There are many ways in which we can split the data. If we had multi-year data, we could have used data for some years as training data and other years as testing data. Our data is for the same period (2016 Q1). We will use a simple approach to randomly divide the dataset into training and test set.

We can use the "sample" command to randomly select certain index number and then use the selected index numbers to divide the dataset into training and testing dataset. Below is the code for doing this. In the code below we use 30% of the data for testing and rest of the 70% for training.

# Sample Indexes
> indexes = sample(1:nrow(loandata), size=0.3*nrow(loandata))
>
# Split data
> data_test = loandata[indexes,]
> dim(data_test)
[1] 17960   145
 
> data_train = loandata[-indexes,]
> dim(data_train)
[1] 41909   145
>

We can now remove the original loandata dataset from R to free up memory.

> rm(loandata)

While building the model, we will emphasize that we avoid picking loans that can default as we don’t want to spoil our ROI. At the same time, we don’t want to pick up just a number of loans as we want to make a sizeable investment.

In the next few lessons we will focus on cleaning the dataset and then training the model.

Related Downloads

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $39 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.