Logistic regression aims to model the probability of an event occurring depending on the values of independent variables.

The logistic regression model seeks to estimate that an event (default) will occur for a randomly selected observation versus the probability that the event does not occur. Suppose we have data for 1000 loans along with all the predictor variables and also whether the borrower defaulted on it or not. Here the probability of default is referred to as the response variable or the dependent variable. The default itself is a binary variable, that is, its value will be either 0 or 1 (0 is no default, and 1 is default).

In logistic regression, the dependent variable is binary, i.e. it only contains data marked as 1 (Default) or 0 (No default).

We can say that logistic regression is a classification algorithm used to predict a binary outcome (1 / 0, Default / No Default) given a set of independent variables. It is a special case of linear regression when the outcome variable is categorical. It predicts the probability of occurrence of a default by fitting data to a logit function.

R is a versatile package and there are many packages that we can use to perform logistic regression. In this case study we will use the glm() function in R. R also has a very useful package called caret (short for classification and regression training) which streamlines the process of training models using different algorithms.) We will use the caret package in the second case study.

We will train the model using the training data set, *credit_train*. Just to keep the whole process simple, we will use only a handful of selected variables in the model. To start with let’s just use five variables to determine the value of *Creditability*.

1 2 |
> set.seed(1) > LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Length.of.current.employment + Sex...Marital.Status, family = binomial, data = credit_train) |

Let’s look into what this model contains:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
> LogisticModel Call: glm(formula = Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Length.of.current.employment + Sex...Marital.Status, family = binomial, data = credit_train) Coefficients: (Intercept) Account.Balance2 -2.1396 0.5810 Account.Balance3 Account.Balance4 1.0231 1.8129 Payment.Status.of.Previous.Credit1 Payment.Status.of.Previous.Credit2 0.2335 1.1462 Payment.Status.of.Previous.Credit3 Payment.Status.of.Previous.Credit4 0.9977 1.8753 Purpose1 Purpose2 0.8587 0.3922 Purpose3 Purpose4 0.6738 0.2800 Purpose5 Purpose6 0.4248 -0.5447 Purpose8 Purpose9 1.9817 0.1309 Purpose10 Length.of.current.employment2 0.3163 -0.1731 Length.of.current.employment3 Length.of.current.employment4 0.1855 0.5960 Length.of.current.employment5 Sex...Marital.Status2 0.2057 0.2572 Sex...Marital.Status3 Sex...Marital.Status4 0.4461 0.5433 Degrees of Freedom: 699 Total (i.e. Null); 676 Residual Null Deviance: 871.5 Residual Deviance: 723.3 AIC: 771.3 > |

Mathematically this can be represented as follows:

In our equation, B0 = -2.1396, B1 = 0.5810, and so on.

We can now use this model to predict the value of Creditability for new data. Let’s say we have a new applicant. We can predict the Creditability of this applicant. Let’s create some random features data for this new applicant:

P(Creditability = 1 | Account.Balance = 4, Payment.Status.of.Previous.Credit = 2, Purpose = 1, Length.of.current.employment = 4,Sex…Marital.Status = 2)

= 1 / ( 1 + e^-( -2.1396 + 1.8129*1 + 1.1462*1+ 0.8587*1+0.5960*1+0.2572*1))

= 0.9263139694

We can also use the `predict()`

function to do the same job. Formula: `predict(model, new.df)`

1 2 3 4 5 6 |
> newapplicant <- data.frame(Account.Balance=as.factor(4), Payment.Status.of.Previous.Credit=as.factor(2), Purpose=as.factor(1), Length.of.current.employment=as.factor(4), Sex...Marital.Status=as.factor(2)) > result <- predict(LogisticModel, type = 'response', newdata = newapplicant) > result 1 0.9263105 > |

Response gives you the numerical result while class gives you the label assigned to that value. Response lets you to determine your threshold. We can determine a threshold value say 0.6, and say if result > 0.6, credibility = 1, else credibility = 0.

1 2 3 |
> if(result>0.6) {credibility = 1} else {credibility = 0} > credibility [1] 1 |

That done, we can move on to fitting the model we just created to the test set, credit_test, and prepare to make our first prediction.

1 |
predicted_values <- predict(LogisticModel, type = 'response', newdata = credit_test) |

We have successfully created and fitted our first model. The predicted_values is a numeric list of the predicted value for each borrower in the credit_test dataset.

1 |
> plot(predicted_values) |

### Class Labels

The predicted values are in the range of 0 and 1. We can apply the cutoff (-.6) and apply labels (0 and 1) to all the predicted values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
> pred_value_labels = rep(as.factor("0"),length(credit_test)) > pred_value_labels = rep("0",length(credit_test[,1])) > pred_value_labels[predicted_values>.6] = "1" > pred_value_labels <- as.factor(pred_value_labels) > pred_value_labels [1] 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 1 [50] 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 [99] 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 [148] 0 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 [197] 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 [246] 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 0 [295] 1 1 1 0 0 0 Levels: 0 1 > |

In the next lesson we will learn about how we can statistically measure the performance of the fitted model.

## Leave a Reply