In this case study, we will not focus much on preprocessing and just use the data as it is. However, we will make a few minor transformations.
Let's take a quick look at the data using the str()
function.
1> str(creditdata)
2
3'data.frame': 1000 obs. of 21 variables:
4 $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
5 $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
6 $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
7 $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
8 $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
9 $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
10 $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
11 $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
12 $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
13 $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
14 $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
15 $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
16 $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
17 $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
18 $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
19 $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
20 $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
21 $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
22 $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
23 $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
24 $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
25
Remove Truly Numeric Data
Look at the three columns: “Duration of Credit (months)”, “Credit Amount”, and “Age”.
These are all absolute numbers. What we want to focus on instead is the attributes that have an impact on the creditworthiness of the applicant. For example, whether the applicant has a phone or not. In Germany, at the time this data was published, the phone rates were high so only a few people could afford to have a phone at their home. Similarly an applicant’s marital status has financial consequences. So, we will remove the columns that contain absolute numeric data and focus only on categorical factors.
We do remove “Duration of Credit (months)”, “Credit Amount”, and “Age” by creating an object that excludes these selected columns.
1> S <- c(1, 2, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21)
2
Convert Integers to Factors
In our data set, the values for these attributes is still stored as integers. We will create a function to convert integers to factors.
1> for(i in S) creditdata[, i] <- as.factor(creditdata[, i])
2> creditdata_new <- creditdata[,S]
3
Now that we have the data in useful shape, we can begin to apply different analytical methods.