While building any predictive model, it is important to first understand whether it is a classification or a regression problem. Let’s understand the difference between the two:
In a classification problem, we are trying to predict the class of a data point (discreet number of values). The Y variable that we are trying to predict generally comes in categorical form and has a finite number of classes. For example, we can classify a loan as Default or No Default. Or we can classify an image as a cat or a dog. The credit risk problem that we are trying to solve is a classification problem. We call it a binary classification when there are only one of the two classes to predict (Default or No Default – 0 or 1). If we have more than 2 classes, we call it a multi-classification problem. Such models are commonly referred to as “classifiers”.
The problem we are solving is considered a regression problem if we are predicting a continuous valued output, for example, predicting the price of a house, or stock prices.
When we are solving a data science problem, we will first define our problem as a classification or a regression problem, depending on the output that we are trying to predict.
In our case, we can conclude that predicting default is a classification problem. Let’s now start with our first case study and understand the steps involved in model building.