• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Finance Train

Finance Train

High Quality tutorials for finance, risk, data science

  • Home
  • Data Science
  • CFA® Exam
  • PRM Exam
  • Tutorials
  • Careers
  • Products
  • Login

Predictive Modelling: Splitting Data into Training and Test Set

Data Science

One of the most important jobs of a data scientist is to build predictive models for specific business problems. For example, what is the probability that a consumer will default on its loan payment in the next month or its credit card payment. This is a typical example of a predictive model where the data scientist will take a huge amount of loan data or credit cards transactions data and based on the patterns in this data, will try to fit a model that could be applied to future behavior of the consumers.

Let’s say the data scientist working with a bank is asked to build this loan default prediction model. He has access to the borrower’s data. The data set contains many important information about borrowers such as their employment status, years of employment, annual income, marital status, age, amount of loan taken, whether they have defaulted on their loan, and many other variables. The data scientist’s job in this case would be to analyze this data build a model that could accurately predict the loan default status. Of course, this is past data, and we already have the loan default status in our data set. Let’s say we have 1 million records in our data set. For each of these 1 million records, we have all the personal and professional information about the borrowers and we also have their loan default status. So, we are trying to find the relationship between or the impact of all these variables on this one variable, i.e., their loan default status.

The data scientist can take this entire one million record set and try to fit a predictive model (for example, a regression model) with all the variables as explanatory variables and the loan default status as the response variable (the one that we are trying to predict). Once the model is ready, then it can be used to predict the default in case of a new borrower. We can feed in all the explanatory variables (personal and professional information in our example), and the model will be able to predict the probability of default by that borrower.

This is how the process works, however, there is one challenge if the entire 1 million record set is used to build/train the model. Once the model is ready, there is no dataset that we can test the model on before applying it on real-life future borrowers. To solve this problem, the data scientists don’t usually use the entire available dataset to train and build the model. Instead they divide the dataset into two sets: 1) Training set and 2) Testing set.

Once the data scientist has two data sets, they will use the training set to build and train the model. Once the model is ready, they will test it on the testing set for accuracy and how well it performs. The objective is to have the model perform on any data with the highest accuracy. Only when the model has been trained well and tested thoroughly, it is used for actual prediction applications.

There are various methods that can be used to split the data into training and testing sets. Generally, the records will be assigned to training and testing sets randomly so that both sets resemble in their properties as much as possible. Split sizes can also differ based on scenario: it could be 50:50, 60:40, or 2/3rd and 1/3rd. While there are many empirical studies and papers on the best way to split data, 80/20 or 70/30 split are widely used. Rule of thumb is that: the more training data you have, the better your model will be. Another good technique is cross-validation. In cross-validation, you split your data into n bins. You then do leave-one-out training. You train on all the data bins except for 1, and use this remaining bin to test. Repeat this procedure n times and take the average of your accuracies.

Join Our Facebook Group - Finance, Risk and Data Science

Posts You May Like

How to Improve your Financial Health

CFA® Exam Overview and Guidelines (Updated for 2021)

Changing Themes (Look and Feel) in ggplot2 in R

Coordinates in ggplot2 in R

Facets for ggplot2 Charts in R (Faceting Layer)

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Latest Tutorials

    • Data Visualization with R
    • Derivatives with R
    • Machine Learning in Finance Using Python
    • Credit Risk Modelling in R
    • Quantitative Trading Strategies in R
    • Financial Time Series Analysis in R
    • VaR Mapping
    • Option Valuation
    • Financial Reporting Standards
    • Fraud
Facebook Group

Membership

Unlock full access to Finance Train and see the entire library of member-only content and resources.

Subscribe

Footer

Recent Posts

  • How to Improve your Financial Health
  • CFA® Exam Overview and Guidelines (Updated for 2021)
  • Changing Themes (Look and Feel) in ggplot2 in R
  • Coordinates in ggplot2 in R
  • Facets for ggplot2 Charts in R (Faceting Layer)

Products

  • Level I Authority for CFA® Exam
  • CFA Level I Practice Questions
  • CFA Level I Mock Exam
  • Level II Question Bank for CFA® Exam
  • PRM Exam 1 Practice Question Bank
  • All Products

Quick Links

  • Privacy Policy
  • Contact Us

CFA Institute does not endorse, promote or warrant the accuracy or quality of Finance Train. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

Copyright © 2021 Finance Train. All rights reserved.

  • About Us
  • Privacy Policy
  • Contact Us