- Credit Risk Modelling - Case Studies
- Classification vs. Regression Models
- Case Study - German Credit - Steps to Build a Predictive Model
- Import Credit Data Set in R
- German Credit Data : Data Preprocessing and Feature Selection in R
- Credit Modelling: Training and Test Data Sets
- Build the Predictive Model
- Logistic Regression Model in R
- Measure Model Performance in R Using ROCR Package
- Create a Confusion Matrix in R
- Credit Risk Modelling - Case Study- Lending Club Data
- Explore Loan Data in R - Loan Grade and Interest Rate
- Credit Risk Modelling - Required R Packages
- Loan Data - Training and Test Data Sets
- Data Cleaning in R - Part 1
- Data Cleaning in R - Part 2
- Data Cleaning in R - Part 3
- Data Cleaning in R - Part 5
- Remove Dimensions By Fitting Logistic Regression
- Create a Function and Prepare Test Data in R
- Building Credit Risk Model
- Credit Risk - Logistic Regression Model in R
- Support Vector Machine (SVM) Model in R
- Random Forest Model in R
- Extreme Gradient Boosting in R
- Predictive Modelling: Averaging Results from Multiple Models
- Predictive Modelling: Comparing Model Results
- How Insurance Companies Calculate Risk
Logistic Regression Model in R
Logistic regression aims to model the probability of an event occurring depending on the values of independent variables.
The logistic regression model seeks to estimate that an event (default) will occur for a randomly selected observation versus the probability that the event does not occur. Suppose we have data for 1000 loans along with all the predictor variables and also whether the borrower defaulted on it or not. Here the probability of default is referred to as the response variable or the dependent variable. The default itself is a binary variable, that is, its value will be either 0 or 1 (0 is no default, and 1 is default).
In logistic regression, the dependent variable is binary, i.e. it only contains data marked as 1 (Default) or 0 (No default).
We can say that logistic regression is a classification algorithm used to predict a binary outcome (1 / 0, Default / No Default) given a set of independent variables. It is a special case of linear regression when the outcome variable is categorical. It predicts the probability of occurrence of a default by fitting data to a logit function.
R is a versatile package and there are many packages that we can use to perform logistic regression. In this case study we will use the glm() function in R. R also has a very useful package called caret (short for classification and regression training) which streamlines the process of training models using different algorithms.) We will use the caret package in the second case study.
We will train the model using the training data set, credit_train. Just to keep the whole process simple, we will use only a handful of selected variables in the model. To start with let's just use five variables to determine the value of Creditability.
> set.seed(1)
> LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Length.of.current.employment + Sex...Marital.Status, family = binomial, data = credit_train)
Let’s look into what this model contains:
> LogisticModel
Call: glm(formula = Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit +
Purpose + Length.of.current.employment + Sex...Marital.Status,
family = binomial, data = credit_train)
Coefficients:
(Intercept) Account.Balance2
-2.1396 0.5810
Account.Balance3 Account.Balance4
1.0231 1.8129
Payment.Status.of.Previous.Credit1 Payment.Status.of.Previous.Credit2
0.2335 1.1462
Payment.Status.of.Previous.Credit3 Payment.Status.of.Previous.Credit4
0.9977 1.8753
Purpose1 Purpose2
0.8587 0.3922
Purpose3 Purpose4
0.6738 0.2800
Purpose5 Purpose6
0.4248 -0.5447
Purpose8 Purpose9
1.9817 0.1309
Purpose10 Length.of.current.employment2
0.3163 -0.1731
Length.of.current.employment3 Length.of.current.employment4
0.1855 0.5960
Length.of.current.employment5 Sex...Marital.Status2
0.2057 0.2572
Sex...Marital.Status3 Sex...Marital.Status4
0.4461 0.5433
Degrees of Freedom: 699 Total (i.e. Null); 676 Residual
Null Deviance: 871.5
Residual Deviance: 723.3 AIC: 771.3
>
Mathematically this can be represented as follows:
In our equation, B0 = -2.1396, B1 = 0.5810, and so on.
We can now use this model to predict the value of Creditability for new data. Let’s say we have a new applicant. We can predict the Creditability of this applicant. Let’s create some random features data for this new applicant:
P(Creditability = 1 | Account.Balance = 4, Payment.Status.of.Previous.Credit = 2, Purpose = 1, Length.of.current.employment = 4,Sex…Marital.Status = 2)
\= 1 / ( 1 + e^-( -2.1396 + 1.8129*1 + 1.1462*1+ 0.8587*1+0.5960*1+0.2572*1))
\= 0.9263139694
We can also use the predict()
function to do the same job. Formula: predict(model, new.df)
> newapplicant <- data.frame(Account.Balance=as.factor(4), Payment.Status.of.Previous.Credit=as.factor(2), Purpose=as.factor(1), Length.of.current.employment=as.factor(4), Sex...Marital.Status=as.factor(2))
> result <- predict(LogisticModel, type = 'response', newdata = newapplicant)
> result
1
0.9263105
>
Response gives you the numerical result while class gives you the label assigned to that value. Response lets you to determine your threshold. We can determine a threshold value say 0.6, and say if result > 0.6, credibility = 1, else credibility = 0.
> if(result>0.6) {credibility = 1} else {credibility = 0}
> credibility
[1] 1
That done, we can move on to fitting the model we just created to the test set, credit_test, and prepare to make our first prediction.
predicted_values <- predict(LogisticModel, type = 'response', newdata = credit_test)
We have successfully created and fitted our first model. The predicted_values is a numeric list of the predicted value for each borrower in the credit_test dataset.
> plot(predicted_values)
Class Labels
The predicted values are in the range of 0 and 1. We can apply the cutoff (-.6) and apply labels (0 and 1) to all the predicted values.
> pred_value_labels = rep(as.factor("0"),length(credit_test))
> pred_value_labels = rep("0",length(credit_test[,1]))
> pred_value_labels[predicted_values>.6] = "1"
> pred_value_labels <- as.factor(pred_value_labels)
> pred_value_labels
[1] 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 1
[50] 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
[99] 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0
[148] 0 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0
[197] 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0
[246] 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 0
[295] 1 1 1 0 0 0
Levels: 0 1
>
In the next lesson we will learn about how we can statistically measure the performance of the fitted model.
You may find these interesting
Related Downloads
Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.