Data Import and Basic Manipulation in R - German Credit Dataset
To learn data visualization with ggplot2
in R, we will be making use of various datasets. However, one interesting dataset that we will be using quite a lot in this section is the German Credit dataset.
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.
When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision:
- If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
- If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank
In this section, we will explore the dataset using ggplot2 and create both exploratory as well as explanatory data visualizations. However, later in another course we will also use this dataset to build a predictive credit risk model. To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
You can download the data from the following link:
Attributes of German Credit Data
Number of Attributes: 20 (7 numerical, 13 categorical).
Attribute | Description | Type |
Status of existing checking account | A11 : < 0 DMA12 : 0 <= ... < 200 DMA13 : >= 200 DMA14 : No checking account | Qualitative |
Duration of Credit (in months) | Numerical | |
Credit history | A30 : no credits taken/ all credits paid back dulyA31 : all credits at this bank paid back dulyA32 : existing credits paid back duly till nowA33 : delay in paying off in the pastA34 : critical account/other credits existing (not at this bank) | Qualitative |
Purpose of Loan | A40 : car (new)A41 : car (used)A42 : furniture/equipmentA43 : radio/televisionA44 : domestic appliancesA45 : repairsA46 : educationA47 : vacationA48 : retrainingA49 : businessA410 : others | Qualitative |
Credit amount | Numerical | |
Savings account/bonds | A61 : < 100 DMA62 : 100 <= ... < 500 DMA63 : 500 <= ... < 1000 DMA64 : ... >= 1000 DMA65 : unknown/ no savings account | Qualitative |
Present employment since | A71 : unemployedA72 : ... < 1 yearA73 : 1 <= ... < 4 yearsA74 : 4 <= ... < 7 yearsA75 : .. >= 7 years | Qualitative |
Installment rate (%) | Numerical | |
Personal status and sex | A91 : male : divorced/separatedA92 : female : divorced/separated/marriedA93 : male : singleA94 : male : married/widowedA95 : female : single | Qualitative |
Guarantors | A101 : noneA102 : co-applicantA103 : guarantor | Qualitative |
Present residence since | Numerical | |
Most valuable available asset | A121 : real estateA122 : if notA121 : building society savings agreement/ life insuranceA123 : if not A121/A122 : car or other, not in attribute 6A124 : unknown / no property | Qualitative |
Age in years | Numerical | |
Concurrent Credits | A141 : bankA142 : storesA143 : none | Qualitative |
Type of housing | A151 : rentA152 : ownA153 : for free | Qualitative |
Number of existing credits at this bank | Numerical | |
Job | A171 : unemployed/ unskilled - non-residentA172 : unskilled - residentA173 : skilled employee / officialA174 : management/ self-employed/ highly qualified employee/ officer | Qualitative |
No of dependents | Numerical | |
Telephone | A191 : noneA192 : yes, registered under the customers name | Qualitative |
Foreign Worker | A201 : yesA202 : no | Qualitative |
Loan Quality | 1 : Bad loan2 : Good loan | Qualitative |
While we use this data to learn the techniques of data visualizations, we will also be learning other important principles of data science, specially the process of data cleaning. We will work on this data to make it suitable for our analysis and to make visualizations meaningful.
Data Import and Basic Manipulation
Now that we have the data in CSV format, we will first import it into R as a data frame. I first placed the CSV file in a folder of my choice, then updated my working directory to the folder where the data file is stored. Then I used the read.csv()
command to import the CSV data into a data frame called df
.
1> setwd("C:/Users/FT/Dropbox/FinanceTrain/Courses/Data")
2> getwd()
3[1] "C:/Users/FT/Dropbox/FinanceTrain/Courses/Data"
4> df <- read.csv("german-credit.csv")
5
The data is now loaded into our R session in df
dataframe. We can inspect the structure of the dataframe using the str()
function.
1> str(df)
2'data.frame': 1000 obs. of 21 variables:
3 $ Status.of.existing.checking.account : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
4 $ Duration.of.Credit..in.months. : int 6 48 12 42 24 36 24 36 12 30 ...
5 $ Credit.history : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
6 $ Purpose.of.Loan : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
7 $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
8 $ Savings.account.bonds : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
9 $ Present.employment.since : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
10 $ Installment.rate.... : int 4 2 2 2 3 2 3 2 2 4 ...
11 $ Personal.status.and.sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
12 $ Guarantors : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
13 $ Present.residence.since : int 4 2 3 4 4 4 4 2 4 2 ...
14 $ Most.valuable.available.asset : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
15 $ Age.in.years : int 67 22 49 45 53 35 53 35 61 28 ...
16 $ Concurrent.Credits : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
17 $ Type.of.housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
18 $ Number.of.existing.credits.at.this.bank: int 2 1 1 1 2 1 1 1 1 2 ...
19 $ Job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
20 $ No.of.dependents : int 1 1 2 2 2 2 1 1 1 1 ...
21 $ Telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
22 $ Foreign.Worker : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
23 $ Loan.Quality : int 1 2 1 1 2 1 1 1 1 2 ...
24
A few observations about the data:
- There are 1000 observations of 21 variables.
- R keeps the numeric data as it is (int)
- R converts the qualitative data in factors (categories). These are also called levels. Internally, R stores the integer values 1, 2, and 3, and maps the character strings (in alphabetical order, unless I reorder) to these values. For example, the variable
Foreign.Worker
has two levels, namely, A201 and A202 which correspond to 'Yes' and No' in our data. Internally data is stored as 1 and 2.
Relabeling the Factor Levels
Our factors are labeled using some internal codes which are a bit difficult to remember. They are also not very useful data visualizations. For example, in case of Foreign.Worker, the levels A201 and A202 not very intuitive even though we know that they mean 'Yes and 'No' to signify whether it is a foreign worker or not.
In such situation we can easily rename the levels by supplying a new vector of labels. The following example shows changing the level names from A201 and A202 to Yes and No.
1> levels(df$Foreign.Worker) <- c('Yes','No')
2> str(df)
3'data.frame': 1000 obs. of 21 variables:
4 $ Status.of.existing.checking.account : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
5 $ Duration.of.Credit..in.months. : int 6 48 12 42 24 36 24 36 12 30 ...
6 $ Credit.history : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
7 $ Purpose.of.Loan : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
8 $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
9 $ Savings.account.bonds : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
10 $ Present.employment.since : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
11 $ Installment.rate.... : int 4 2 2 2 3 2 3 2 2 4 ...
12 $ Personal.status.and.sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
13 $ Guarantors : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
14 $ Present.residence.since : int 4 2 3 4 4 4 4 2 4 2 ...
15 $ Most.valuable.available.asset : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
16 $ Age.in.years : int 67 22 49 45 53 35 53 35 61 28 ...
17 $ Concurrent.Credits : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
18 $ Type.of.housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
19 $ Number.of.existing.credits.at.this.bank: int 2 1 1 1 2 1 1 1 1 2 ...
20 $ Job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
21 $ No.of.dependents : int 1 1 2 2 2 2 1 1 1 1 ...
22 $ Telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
23 $ Foreign.Worker : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
24 $ Loan.Quality : int 1 2 1 1 2 1 1 1 1 2 ...
25
We can also check the levels of any variable using the levels()
function as shown below:
1> levels(df$Foreign.Worker)
2[1] "Yes" "No"
3
Similarly, we will update the level names for all the qualitative variables in the dataframe to suite our requirements. The following script does that.
1levels(df$Status.of.existing.checking.account) <- c('< 0 DM','0 - 200 DM', '>= 200 DM', 'No checking account')
2levels(df$Credit.history) <- c('No Credits Taken','All Credit Paid', 'Existing Credit Paid','Delay in Payment','Critical Account')
3levels(df$Purpose.of.Loan) <- c('car (new)', 'car (used)', 'furniture/equipment', 'radio/television', 'domestic appliances', 'repairs', 'education', 'vacation', 'retraining', 'business', 'others')
4levels(df$Savings.account.bonds) <- c('< 100 DM', '100 - 500 DM', '500 - 1000 DM', '>= 1000 DM', 'No Savings Account')
5levels(df$Present.employment.since) <- c('unemployed', '< 1 year', '1 - 4 years', '4 - 7 years', '>= 7 years')
6levels(df$Personal.status.and.sex) <- c('male : divorced/separated', 'female : divorced/separated/married', 'male : single', 'male : married/widowed', 'female : single')
7levels(df$Guarantors) <- c('none', 'co-applicant', 'guarantor')
8levels(df$Most.valuable.available.asset) <- c('real estate', 'savings agreement/life insurance', 'car or other', 'unknown / no property')
9levels(df$Concurrent.Credits) <- c('bank', 'stores', 'none')
10levels(df$Type.of.housing) <- c('rent', 'own', 'for free')
11levels(df$Job) <- c('unemployed/ unskilled - non-resident', 'unskilled - resident', 'skilled employee / official', 'management/ self-employed')
12levels(df$Telephone) <- c('No','Yes')
13levels(df$Foreign.Worker) <- c('Yes','No')
14
The above script will update the level names for all the variables as per the labels provided for us. Let's inspect the structure again.
1> str(df)
2'data.frame': 1000 obs. of 21 variables:
3 $ Status.of.existing.checking.account : Factor w/ 4 levels "< 0 DM","0 - 200 DM",..: 1 2 4 1 1 4 4 2 4 2 ...
4 $ Duration.of.Credit..in.months. : int 6 48 12 42 24 36 24 36 12 30 ...
5 $ Credit.history : Factor w/ 5 levels "No Credits Taken",..: 5 3 5 3 4 3 3 3 3 5 ...
6 $ Purpose.of.Loan : Factor w/ 11 levels "car (new)","car (used)",..: 5 5 8 4 1 8 4 2 5 1 ...
7 $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
8 $ Savings.account.bonds : Factor w/ 5 levels "< 100 DM","100 - 500 DM",..: 5 1 1 1 1 5 3 1 4 1 ...
9 $ Present.employment.since : Factor w/ 5 levels "unemployed","< 1 year",..: 5 3 4 4 3 3 5 3 4 1 ...
10 $ Installment.rate.... : int 4 2 2 2 3 2 3 2 2 4 ...
11 $ Personal.status.and.sex : Factor w/ 5 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
12 $ Guarantors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
13 $ Present.residence.since : int 4 2 3 4 4 4 4 2 4 2 ...
14 $ Most.valuable.available.asset : Factor w/ 4 levels "real estate",..: 1 1 1 2 4 4 2 3 1 3 ...
15 $ Age.in.years : int 67 22 49 45 53 35 53 35 61 28 ...
16 $ Concurrent.Credits : Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
17 $ Type.of.housing : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ...
18 $ Number.of.existing.credits.at.this.bank: int 2 1 1 1 2 1 1 1 1 2 ...
19 $ Job : Factor w/ 4 levels "unemployed/ unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
20 $ No.of.dependents : int 1 1 2 2 2 2 1 1 1 1 ...
21 $ Telephone : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 1 2 1 1 ...
22 $ Foreign.Worker : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
23 $ Loan.Quality : int 1 2 1 1 2 1 1 1 1 2 ...
24>
25
Our data is now ready for data visualization.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.
You'll get access to:
- Access complete tutorials and examples
- Download source code and resources
- Follow along with practical exercises
- Get in-depth explanations