Logistic Regression Model in R

Logistic regression aims to model the probability of an event occurring depending on the values of independent variables.

The logistic regression model seeks to estimate that an event (default) will occur for a randomly selected observation versus the probability that the event does not occur. Suppose we have data for 1000 loans along with all the predictor variables and also whether the borrower defaulted on it or not. Here the probability of default is referred to as the response variable or the dependent variable. The default itself is a binary variable, that is, its value will be either 0 or 1 (0 is no default, and 1 is default).

In logistic regression, the dependent variable is binary, i.e. it only contains data marked as 1 (Default) or 0 (No default).

We can say that logistic regression is a classification algorithm used to predict a binary outcome (1 / 0, Default / No Default) given a set of independent variables. It is a special case of linear regression when the outcome variable is categorical. It predicts the probability of occurrence of a default by fitting data to a logit function.

R is a versatile package and there are many packages that we can use to perform logistic regression. In this case study we will use the glm() function in R. R also has a very useful package called caret (short for classification and regression training) which streamlines the process of training models using different algorithms.) We will use the caret package in the second case study.

We will train the model using the training data set, credit_train. Just to keep the whole process simple, we will use only a handful of selected variables in the model. To start with let's just use five variables to determine the value of Creditability.

> set.seed(1)
> LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Length.of.current.employment + Sex...Marital.Status, family = binomial, data = credit_train)

Let’s look into what this model contains:

> LogisticModel
Call:  glm(formula = Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + 
    Purpose + Length.of.current.employment + Sex...Marital.Status, 
    family = binomial, data = credit_train)
Coefficients:
                       (Intercept)                    Account.Balance2  
                           -2.1396                              0.5810  
                  Account.Balance3                    Account.Balance4  
                            1.0231                              1.8129  
Payment.Status.of.Previous.Credit1  Payment.Status.of.Previous.Credit2  
                            0.2335                              1.1462  
Payment.Status.of.Previous.Credit3  Payment.Status.of.Previous.Credit4  
                            0.9977                              1.8753  
                          Purpose1                            Purpose2  
                            0.8587                              0.3922  
                          Purpose3                            Purpose4  
                            0.6738                              0.2800  
                          Purpose5                            Purpose6  
                            0.4248                             -0.5447  
                          Purpose8                            Purpose9  
                            1.9817                              0.1309  
                         Purpose10       Length.of.current.employment2  
                            0.3163                             -0.1731  
     Length.of.current.employment3       Length.of.current.employment4  
                            0.1855                              0.5960  
     Length.of.current.employment5               Sex...Marital.Status2  
                            0.2057                              0.2572  
             Sex...Marital.Status3               Sex...Marital.Status4  
                            0.4461                              0.5433  
Degrees of Freedom: 699 Total (i.e. Null);  676 Residual
Null Deviance:        871.5 
Residual Deviance: 723.3     AIC: 771.3
>

Mathematically this can be represented as follows:

In our equation, B0 = -2.1396, B1 = 0.5810, and so on.

We can now use this model to predict the value of Creditability for new data. Let’s say we have a new applicant. We can predict the Creditability of this applicant. Let’s create some random features data for this new applicant:

P(Creditability = 1 | Account.Balance = 4, Payment.Status.of.Previous.Credit = 2, Purpose = 1, Length.of.current.employment = 4,Sex…Marital.Status = 2)

\= 1 / ( 1 + e^-( -2.1396 + 1.8129*1 + 1.1462*1+ 0.8587*1+0.5960*1+0.2572*1))

\= 0.9263139694

We can also use the predict() function to do the same job. Formula: predict(model, new.df)

> newapplicant <- data.frame(Account.Balance=as.factor(4), Payment.Status.of.Previous.Credit=as.factor(2), Purpose=as.factor(1), Length.of.current.employment=as.factor(4), Sex...Marital.Status=as.factor(2))
> result <- predict(LogisticModel, type = 'response', newdata = newapplicant)
> result
        1 
0.9263105 
>

Response gives you the numerical result while class gives you the label assigned to that value. Response lets you to determine your threshold. We can determine a threshold value say 0.6, and say if result > 0.6, credibility = 1, else credibility = 0.

> if(result>0.6) {credibility = 1} else {credibility = 0}
> credibility
[1] 1

That done, we can move on to fitting the model we just created to the test set, credit_test, and prepare to make our first prediction.

predicted_values <- predict(LogisticModel, type = 'response', newdata = credit_test)

We have successfully created and fitted our first model. The predicted_values is a numeric list of the predicted value for each borrower in the credit_test dataset.

> plot(predicted_values)

Class Labels

The predicted values are in the range of 0 and 1. We can apply the cutoff (-.6) and apply labels (0 and 1) to all the predicted values.

> pred_value_labels = rep(as.factor("0"),length(credit_test))

> pred_value_labels = rep("0",length(credit_test[,1]))
> pred_value_labels[predicted_values>.6] = "1"
> pred_value_labels <- as.factor(pred_value_labels)
> pred_value_labels
  [1] 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 1
 [50] 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
 [99] 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0
[148] 0 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0
[197] 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0
[246] 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 0
[295] 1 1 1 0 0 0
Levels: 0 1
> 

In the next lesson we will learn about how we can statistically measure the performance of the fitted model.

Finance Train Subscription

Unlock full access to Finance Train and see the entire library of member-only content and resources.