Predictive Modelling: Splitting Data into Training and Test Set
One of the most important jobs of a data scientist is to build predictive models for specific business problems. For example, what is the probability that a consumer will default on its loan payment in the next month or its credit card payment. This is a typical example of a predictive model where the data scientist will take a huge amount of loan data or credit cards transactions data and based on the patterns in this data, will try to fit a model that could be applied to future behavior of the consumers.
Let’s say the data scientist working with a bank is asked to build this loan default prediction model. He has access to the borrower's data. The data set contains many important information about borrowers such as their employment status, years of employment, annual income, marital status, age, amount of loan taken, whether they have defaulted on their loan, and many other variables. The data scientist’s job in this case would be to analyze this data build a model that could accurately predict the loan default status. Of course, this is past data, and we already have the loan default status in our data set. Let’s say we have 1 million records in our data set. For each of these 1 million records, we have all the personal and professional information about the borrowers and we also have their loan default status. So, we are trying to find the relationship between or the impact of all these variables on this one variable, i.e., their loan default status.
The data scientist can take this entire one million record set and try to fit a predictive model (for example, a regression model) with all the variables as explanatory variables and the loan default status as the response variable (the one that we are trying to predict). Once the model is ready, then it can be used to predict the default in case of a new borrower. We can feed in all the explanatory variables (personal and professional information in our example), and the model will be able to predict the probability of default by that borrower.
This is how the process works, however, there is one challenge if the entire 1 million record set is used to build/train the model. Once the model is ready, there is no dataset that we can test the model on before applying it on real-life future borrowers. To solve this problem, the data scientists don’t usually use the entire available dataset to train and build the model. Instead they divide the dataset into two sets: 1) Training set and 2) Testing set.
Once the data scientist has two data sets, they will use the training set to build and train the model. Once the model is ready, they will test it on the testing set for accuracy and how well it performs. The objective is to have the model perform on any data with the highest accuracy. Only when the model has been trained well and tested thoroughly, it is used for actual prediction applications.
There are various methods that can be used to split the data into training and testing sets. Generally, the records will be assigned to training and testing sets randomly so that both sets resemble in their properties as much as possible. Split sizes can also differ based on scenario: it could be 50:50, 60:40, or 2/3rd and 1/3rd. While there are many empirical studies and papers on the best way to split data, 80/20 or 70/30 split are widely used. Rule of thumb is that: the more training data you have, the better your model will be. Another good technique is cross-validation. In cross-validation, you split your data into n bins. You then do leave-one-out training. You train on all the data bins except for 1, and use this remaining bin to test. Repeat this procedure n times and take the average of your accuracies.