German Credit Data : Data Preprocessing and Feature Selection in R


The purpose of preprocessing is to make your raw data suitable for the data science algorithms. For example, we may want to remove the outliers, remove or change imputations (missing values, and so on).

The dataset that we have selected does not have any missing data. But, in real time there is possibility that the dataset has many missing or imputed data which needs to be replaced with valid data generated by making use of the available complete data. The k-nearest neighbours algorithm is used for this purpose to perform multiple imputation.

In this case study, we will not focus much on preprocessing and just use the data as it is. However, we will make a few minor transformations.

Let's take a quick look at the data using the str() function.

1> str(creditdata)
3'data.frame':	1000 obs. of  21 variables:
4 $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
5 $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
6 $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
7 $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
8 $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
9 $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
10 $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
11 $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
12 $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
13 $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
14 $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
15 $      : int  4 2 4 2 4 3 4 4 4 4 ...
16 $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
17 $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
18 $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
19 $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
20 $       : int  1 2 1 2 2 2 2 1 2 1 ...
21 $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
22 $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
23 $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
24 $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

Remove Truly Numeric Data

Look at the three columns: “Duration of Credit (months)”, “Credit Amount”, and “Age”.

These are all absolute numbers. What we want to focus on instead is the attributes that have an impact on the creditworthiness of the applicant. For example, whether the applicant has a phone or not. In Germany, at the time this data was published, the phone rates were high so only a few people could afford to have a phone at their home. Similarly an applicant’s marital status has financial consequences. So, we will remove the columns that contain absolute numeric data and focus only on categorical factors.

We do remove “Duration of Credit (months)”, “Credit Amount”, and “Age” by creating an object that excludes these selected columns.

1> S <- c(1, 2, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21)

Convert Integers to Factors

In our data set, the values for these attributes is still stored as integers. We will create a function to convert integers to factors.

1> for(i in S) creditdata[, i] <- as.factor(creditdata[, i])
2> creditdata_new <- creditdata[,S]

Now that we have the data in useful shape, we can begin to apply different analytical methods.