Logistic Regression Model in R

Logistic regression aims to model the probability of an event occurring depending on the values of independent variables.

The logistic regression model seeks to estimate that an event (default) will occur for a randomly selected observation versus the probability that the event does not occur. Suppose we have data for 1000 loans along with all the predictor variables and also whether the borrower defaulted on it or not. Here the probability of default is referred to as the response variable or the dependent variable. The default itself is a binary variable, that is, its value will be either 0 or 1 (0 is no default, and 1 is default).

In logistic regression, the dependent variable is binary, i.e. it only contains data marked as 1 (Default) or 0 (No default).

We can say that logistic regression is a classification algorithm used to predict a binary outcome (1 / 0, Default / No Default) given a set of independent variables. It is a special case of linear regression when the outcome variable is categorical. It predicts the probability of occurrence of a default by fitting data to a logit function.

R is a versatile package and there are many packages that we can use to perform logistic regression. In this case study we will use the glm() function in R. R also has a very useful package called caret (short for classification and regression training) which streamlines the process of training models using different algorithms.) We will use the caret package in the second case study.

We will train the model using the training data set, credit_train. Just to keep the whole process simple, we will use only a handful of selected variables in the model. To start with let's just use five variables to determine the value of Creditability.

1> set.seed(1)
2> LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Length.of.current.employment + Sex...Marital.Status, family = binomial, data = credit_train)
3

Let’s look into what this model contains:

1> LogisticModel
2Call:  glm(formula = Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + 
3    Purpose + Length.of.current.employment + Sex...Marital.Status, 
4    family = binomial, data = credit_train)
5Coefficients:
6                       (Intercept)                    Account.Balance2  
7                           -2.1396                              0.5810  
8                  Account.Balance3                    Account.Balance4  
9                            1.0231                              1.8129  
10Payment.Status.of.Previous.Credit1  Payment.Status.of.Previous.Credit2  
11                            0.2335                              1.1462  
12Payment.Status.of.Previous.Credit3  Payment.Status.of.Previous.Credit4  
13                            0.9977                              1.8753  
14                          Purpose1                            Purpose2  
15                            0.8587                              0.3922  
16                          Purpose3                            Purpose4  
17                            0.6738                              0.2800  
18                          Purpose5                            Purpose6  
19                            0.4248                             -0.5447  
20                          Purpose8                            Purpose9  
21                            1.9817                              0.1309  
22                         Purpose10       Length.of.current.employment2  
23                            0.3163                             -0.1731  
24     Length.of.current.employment3       Length.of.current.employment4  
25                            0.1855                              0.5960  
26     Length.of.current.employment5               Sex...Marital.Status2  
27                            0.2057                              0.2572  
28             Sex...Marital.Status3               Sex...Marital.Status4  
29                            0.4461                              0.5433  
30Degrees of Freedom: 699 Total (i.e. Null);  676 Residual
31Null Deviance:        871.5 
32Residual Deviance: 723.3     AIC: 771.3
33>
34

Mathematically this can be represented as follows:

In our equation, B0 = -2.1396, B1 = 0.5810, and so on.

We can now use this model to predict the value of Creditability for new data. Let’s say we have a new applicant. We can predict the Creditability of this applicant. Let’s create some random features data for this new applicant:

P(Creditability = 1 | Account.Balance = 4, Payment.Status.of.Previous.Credit = 2, Purpose = 1, Length.of.current.employment = 4,Sex…Marital.Status = 2)

= 1 / ( 1 + e^-( -2.1396 + 1.8129*1 + 1.1462*1+ 0.8587*1+0.5960*1+0.2572*1))

= 0.9263139694

We can also use the predict() function to do the same job. Formula: predict(model, new.df)

1> newapplicant <- data.frame(Account.Balance=as.factor(4), Payment.Status.of.Previous.Credit=as.factor(2), Purpose=as.factor(1), Length.of.current.employment=as.factor(4), Sex...Marital.Status=as.factor(2))
2> result <- predict(LogisticModel, type = 'response', newdata = newapplicant)
3> result
4        1 
50.9263105 
6>
7

Response gives you the numerical result while class gives you the label assigned to that value. Response lets you to determine your threshold. We can determine a threshold value say 0.6, and say if result > 0.6, credibility = 1, else credibility = 0.

1> if(result>0.6) {credibility = 1} else {credibility = 0}
2> credibility
3[1] 1
4

That done, we can move on to fitting the model we just created to the test set, credit_test, and prepare to make our first prediction.

1predicted_values <- predict(LogisticModel, type = 'response', newdata = credit_test)
2

We have successfully created and fitted our first model. The predicted_values is a numeric list of the predicted value for each borrower in the credit_test dataset.

> plot(predicted_values)

Class Labels

The predicted values are in the range of 0 and 1. We can apply the cutoff (-.6) and apply labels (0 and 1) to all the predicted values.

1> pred_value_labels = rep(as.factor("0"),length(credit_test))
2
3> pred_value_labels = rep("0",length(credit_test[,1]))
4> pred_value_labels[predicted_values>.6] = "1"
5> pred_value_labels <- as.factor(pred_value_labels)
6> pred_value_labels
7  [1] 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 1
8 [50] 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
9 [99] 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0
10[148] 0 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0
11[197] 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0
12[246] 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 0
13[295] 1 1 1 0 0 0
14Levels: 0 1
15> 
16

In the next lesson we will learn about how we can statistically measure the performance of the fitted model.

Learn

Resources

Logistic Regression Model in R

Class Labels

Build the Predictive Model

Measure Model Performance in R Using ROCR Package

Credit Risk Modelling in R

Data Science for Finance Bundle

Topics