Logistic Regression Model in R
Logistic regression aims to model the probability of an event occurring depending on the values of independent variables.
The logistic regression model seeks to estimate that an event (default) will occur for a randomly selected observation versus the probability that the event does not occur. Suppose we have data for 1000 loans along with all the predictor variables and also whether the borrower defaulted on it or not. Here the probability of default is referred to as the response variable or the dependent variable. The default itself is a binary variable, that is, its value will be either 0 or 1 (0 is no default, and 1 is default).
In logistic regression, the dependent variable is binary, i.e. it only contains data marked as 1 (Default) or 0 (No default).
We can say that logistic regression is a classification algorithm used to predict a binary outcome (1 / 0, Default / No Default) given a set of independent variables. It is a special case of linear regression when the outcome variable is categorical. It predicts the probability of occurrence of a default by fitting data to a logit function.
R is a versatile package and there are many packages that we can use to perform logistic regression. In this case study we will use the glm() function in R. R also has a very useful package called caret (short for classification and regression training) which streamlines the process of training models using different algorithms.) We will use the caret package in the second case study.
We will train the model using the training data set, credit_train. Just to keep the whole process simple, we will use only a handful of selected variables in the model. To start with let's just use five variables to determine the value of Creditability.
1> set.seed(1)
2> LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Length.of.current.employment + Sex...Marital.Status, family = binomial, data = credit_train)
3
Let’s look into what this model contains:
1> LogisticModel
2Call: glm(formula = Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit +
3 Purpose + Length.of.current.employment + Sex...Marital.Status,
4 family = binomial, data = credit_train)
5Coefficients:
6 (Intercept) Account.Balance2
7 -2.1396 0.5810
8 Account.Balance3 Account.Balance4
9 1.0231 1.8129
10Payment.Status.of.Previous.Credit1 Payment.Status.of.Previous.Credit2
11 0.2335 1.1462
12Payment.Status.of.Previous.Credit3 Payment.Status.of.Previous.Credit4
13 0.9977 1.8753
14 Purpose1 Purpose2
15 0.8587 0.3922
16 Purpose3 Purpose4
17 0.6738 0.2800
18 Purpose5 Purpose6
19 0.4248 -0.5447
20 Purpose8 Purpose9
21 1.9817 0.1309
22 Purpose10 Length.of.current.employment2
23 0.3163 -0.1731
24 Length.of.current.employment3 Length.of.current.employment4
25 0.1855 0.5960
26 Length.of.current.employment5 Sex...Marital.Status2
27 0.2057 0.2572
28 Sex...Marital.Status3 Sex...Marital.Status4
29 0.4461 0.5433
30Degrees of Freedom: 699 Total (i.e. Null); 676 Residual
31Null Deviance: 871.5
32Residual Deviance: 723.3 AIC: 771.3
33>
34
Mathematically this can be represented as follows:
In our equation, B0 = -2.1396, B1 = 0.5810, and so on.
We can now use this model to predict the value of Creditability for new data. Let’s say we have a new applicant. We can predict the Creditability of this applicant. Let’s create some random features data for this new applicant:
P(Creditability = 1 | Account.Balance = 4, Payment.Status.of.Previous.Credit = 2, Purpose = 1, Length.of.current.employment = 4,Sex…Marital.Status = 2)
= 1 / ( 1 + e^-( -2.1396 + 1.8129*1 + 1.1462*1+ 0.8587*1+0.5960*1+0.2572*1))
= 0.9263139694
We can also use the predict()
function to do the same job. Formula: predict(model, new.df)
1> newapplicant <- data.frame(Account.Balance=as.factor(4), Payment.Status.of.Previous.Credit=as.factor(2), Purpose=as.factor(1), Length.of.current.employment=as.factor(4), Sex...Marital.Status=as.factor(2))
2> result <- predict(LogisticModel, type = 'response', newdata = newapplicant)
3> result
4 1
50.9263105
6>
7
Response gives you the numerical result while class gives you the label assigned to that value. Response lets you to determine your threshold. We can determine a threshold value say 0.6, and say if result > 0.6, credibility = 1, else credibility = 0.
1> if(result>0.6) {credibility = 1} else {credibility = 0}
2> credibility
3[1] 1
4
That done, we can move on to fitting the model we just created to the test set, credit_test, and prepare to make our first prediction.
1predicted_values <- predict(LogisticModel, type = 'response', newdata = credit_test)
2
We have successfully created and fitted our first model. The predicted_values is a numeric list of the predicted value for each borrower in the credit_test dataset.
> plot(predicted_values)
Class Labels
The predicted values are in the range of 0 and 1. We can apply the cutoff (-.6) and apply labels (0 and 1) to all the predicted values.
1> pred_value_labels = rep(as.factor("0"),length(credit_test))
2
3> pred_value_labels = rep("0",length(credit_test[,1]))
4> pred_value_labels[predicted_values>.6] = "1"
5> pred_value_labels <- as.factor(pred_value_labels)
6> pred_value_labels
7 [1] 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 1
8 [50] 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
9 [99] 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0
10[148] 0 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0
11[197] 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0
12[246] 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 0
13[295] 1 1 1 0 0 0
14Levels: 0 1
15>
16
In the next lesson we will learn about how we can statistically measure the performance of the fitted model.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.
You'll get access to:
- Access complete tutorials and examples
- Download source code and resources
- Follow along with practical exercises
- Get in-depth explanations