Credit Modelling: Training and Test Data Sets

For building the model, we will divide our data into two different data sets, namely training and testing datasets. The model will be built using the training set and then we will test it on the testing set to evaluate how our model is performing.

There are many ways in which we can split the data.

We can use the “sample” command to randomly select certain index numbers and then use the selected index numbers to divide the dataset into training and testing dataset. Below is the code for doing this. In the code below we use 30% of the data for testing and rest of the 70% for training.

# Sample Indexes
> indexes = sample(1:nrow(creditdata), size=0.3*nrow(creditdata))
# Split data
> credit_test = creditdata_new[indexes,]
> credit_train = creditdata_new[-indexes,]
> dim(credit_test)
[1] 300  18
> dim(credit_train)
[1] 700  18

Other Ways to Split Data

  1. We can use the rpart function of the rpart package to split the data. RPART stands for Recursive Partitioning And Regression Trees. The rpart algorithm works by splitting the dataset recursively, which means that the subsets that arise from a split are further split until a predetermined termination criterion is reached. It allows you to construct splitting rules in many different ways.
  2. We can also use the createDataPartition function of the caret package to split the data set

Please login to view this lesson.

With our free registration, you can access to all the lessons on finance, risk, data analytics and data science for finance professionals.

Sign in free

Course Downloads

Member Only