To learn data visualization with ggplot2
in R, we will be making use of various datasets. However, one interesting dataset that we will be using quite a lot in this section is the German Credit dataset.
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.
When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision:
- If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
- If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank
In this section, we will explore the dataset using ggplot2 and create both exploratory as well as explanatory data visualizations. However, later in another course we will also use this dataset to build a predictive credit risk model. To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
You can download the data from the following link:
Attributes of German Credit Data
Number of Attributes: 20 (7 numerical, 13 categorical).
Attribute | Description | Type |
Status of existing checking account | A11 : < 0 DM A12 : 0 <= … < 200 DM A13 : >= 200 DM A14 : No checking account | Qualitative |
Duration of Credit (in months) | Numerical | |
Credit history | A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/other credits existing (not at this bank) | Qualitative |
Purpose of Loan | A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : vacation A48 : retraining A49 : business A410 : others | Qualitative |
Credit amount | Numerical | |
Savings account/bonds | A61 : < 100 DM A62 : 100 <= … < 500 DM A63 : 500 <= … < 1000 DM A64 : … >= 1000 DM A65 : unknown/ no savings account | Qualitative |
Present employment since | A71 : unemployed A72 : … < 1 year A73 : 1 <= … < 4 years A74 : 4 <= … < 7 years A75 : .. >= 7 years | Qualitative |
Installment rate (%) | Numerical | |
Personal status and sex | A91 : male : divorced/separated A92 : female : divorced/separated/married A93 : male : single A94 : male : married/widowed A95 : female : single | Qualitative |
Guarantors | A101 : none A102 : co-applicant A103 : guarantor | Qualitative |
Present residence since | Numerical | |
Most valuable available asset | A121 : real estate A122 : if not A121 : building society savings agreement/ life insurance A123 : if not A121/A122 : car or other, not in attribute 6 A124 : unknown / no property | Qualitative |
Age in years | Numerical | |
Concurrent Credits | A141 : bank A142 : stores A143 : none | Qualitative |
Type of housing | A151 : rent A152 : own A153 : for free | Qualitative |
Number of existing credits at this bank | Numerical | |
Job | A171 : unemployed/ unskilled – non-resident A172 : unskilled – resident A173 : skilled employee / official A174 : management/ self-employed/ highly qualified employee/ officer | Qualitative |
No of dependents | Numerical | |
Telephone | A191 : none A192 : yes, registered under the customers name | Qualitative |
Foreign Worker | A201 : yes A202 : no | Qualitative |
Loan Quality | 1 : Bad loan 2 : Good loan | Qualitative |
While we use this data to learn the techniques of data visualizations, we will also be learning other important principles of data science, specially the process of data cleaning. We will work on this data to make it suitable for our analysis and to make visualizations meaningful.
Data Import and Basic Manipulation
Now that we have the data in CSV format, we will first import it into R as a data frame. I first placed the CSV file in a folder of my choice, then updated my working directory to the folder where the data file is stored. Then I used the read.csv()
command to import the CSV data into a data frame called df
.
> setwd("C:/Users/FT/Dropbox/FinanceTrain/Courses/Data")
> getwd()
[1] "C:/Users/FT/Dropbox/FinanceTrain/Courses/Data"
> df <- read.csv("german-credit.csv")
The data is now loaded into our R session indf
dataframe. We can inspect the structure of the dataframe using thestr()
function.
> str(df)
'data.frame': 1000 obs. of 21 variables:
$ Status.of.existing.checking.account : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
$ Duration.of.Credit..in.months. : int 6 48 12 42 24 36 24 36 12 30 ...
$ Credit.history : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
$ Purpose.of.Loan : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
$ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ Savings.account.bonds : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
$ Present.employment.since : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
$ Installment.rate.... : int 4 2 2 2 3 2 3 2 2 4 ...
$ Personal.status.and.sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
$ Guarantors : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
$ Present.residence.since : int 4 2 3 4 4 4 4 2 4 2 ...
$ Most.valuable.available.asset : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
$ Age.in.years : int 67 22 49 45 53 35 53 35 61 28 ...
$ Concurrent.Credits : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Type.of.housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
$ Number.of.existing.credits.at.this.bank: int 2 1 1 1 2 1 1 1 1 2 ...
$ Job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
$ No.of.dependents : int 1 1 2 2 2 2 1 1 1 1 ...
$ Telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
$ Foreign.Worker : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
$ Loan.Quality : int 1 2 1 1 2 1 1 1 1 2 ...
A few observations about the data:
- There are 1000 observations of 21 variables.
- R keeps the numeric data as it is (int)
- R converts the qualitative data in factors (categories). These are also called levels. Internally, R stores the integer values 1, 2, and 3, and maps the character strings (in alphabetical order, unless I reorder) to these values. For example, the variable
Foreign.Worker
has two levels, namely, A201 and A202 which correspond to ‘Yes’ and No’ in our data. Internally data is stored as 1 and 2.
Relabeling the Factor Levels
Our factors are labeled using some internal codes which are a bit difficult to remember. They are also not very useful data visualizations. For example, in case of Foreign.Worker, the levels A201 and A202 not very intuitive even though we know that they mean ‘Yes and ‘No’ to signify whether it is a foreign worker or not.
In such situation we can easily rename the levels by supplying a new vector of labels. The following example shows changing the level names from A201 and A202 to Yes and No.
> levels(df$Foreign.Worker) <- c('Yes','No')
> str(df)
'data.frame': 1000 obs. of 21 variables:
$ Status.of.existing.checking.account : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
$ Duration.of.Credit..in.months. : int 6 48 12 42 24 36 24 36 12 30 ...
$ Credit.history : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
$ Purpose.of.Loan : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
$ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ Savings.account.bonds : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
$ Present.employment.since : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
$ Installment.rate.... : int 4 2 2 2 3 2 3 2 2 4 ...
$ Personal.status.and.sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
$ Guarantors : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
$ Present.residence.since : int 4 2 3 4 4 4 4 2 4 2 ...
$ Most.valuable.available.asset : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
$ Age.in.years : int 67 22 49 45 53 35 53 35 61 28 ...
$ Concurrent.Credits : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Type.of.housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
$ Number.of.existing.credits.at.this.bank: int 2 1 1 1 2 1 1 1 1 2 ...
$ Job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
$ No.of.dependents : int 1 1 2 2 2 2 1 1 1 1 ...
$ Telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
$ Foreign.Worker : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
$ Loan.Quality : int 1 2 1 1 2 1 1 1 1 2 ...
We can also check the levels of any variable using the levels()
function as shown below:
> levels(df$Foreign.Worker)
[1] "Yes" "No"
Similarly, we will update the level names for all the qualitative variables in the dataframe to suite our requirements. The following script does that.
levels(df$Status.of.existing.checking.account) <- c('< 0 DM','0 - 200 DM', '>= 200 DM', 'No checking account')
levels(df$Credit.history) <- c('No Credits Taken','All Credit Paid', 'Existing Credit Paid','Delay in Payment','Critical Account')
levels(df$Purpose.of.Loan) <- c('car (new)', 'car (used)', 'furniture/equipment', 'radio/television', 'domestic appliances', 'repairs', 'education', 'vacation', 'retraining', 'business', 'others')
levels(df$Savings.account.bonds) <- c('< 100 DM', '100 - 500 DM', '500 - 1000 DM', '>= 1000 DM', 'No Savings Account')
levels(df$Present.employment.since) <- c('unemployed', '< 1 year', '1 - 4 years', '4 - 7 years', '>= 7 years')
levels(df$Personal.status.and.sex) <- c('male : divorced/separated', 'female : divorced/separated/married', 'male : single', 'male : married/widowed', 'female : single')
levels(df$Guarantors) <- c('none', 'co-applicant', 'guarantor')
levels(df$Most.valuable.available.asset) <- c('real estate', 'savings agreement/life insurance', 'car or other', 'unknown / no property')
levels(df$Concurrent.Credits) <- c('bank', 'stores', 'none')
levels(df$Type.of.housing) <- c('rent', 'own', 'for free')
levels(df$Job) <- c('unemployed/ unskilled - non-resident', 'unskilled - resident', 'skilled employee / official', 'management/ self-employed')
levels(df$Telephone) <- c('No','Yes')
levels(df$Foreign.Worker) <- c('Yes','No')
The above script will update the level names for all the variables as per the labels provided for us. Let’s inspect the structure again.
> str(df)
'data.frame': 1000 obs. of 21 variables:
$ Status.of.existing.checking.account : Factor w/ 4 levels "< 0 DM","0 - 200 DM",..: 1 2 4 1 1 4 4 2 4 2 ...
$ Duration.of.Credit..in.months. : int 6 48 12 42 24 36 24 36 12 30 ...
$ Credit.history : Factor w/ 5 levels "No Credits Taken",..: 5 3 5 3 4 3 3 3 3 5 ...
$ Purpose.of.Loan : Factor w/ 11 levels "car (new)","car (used)",..: 5 5 8 4 1 8 4 2 5 1 ...
$ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ Savings.account.bonds : Factor w/ 5 levels "< 100 DM","100 - 500 DM",..: 5 1 1 1 1 5 3 1 4 1 ...
$ Present.employment.since : Factor w/ 5 levels "unemployed","< 1 year",..: 5 3 4 4 3 3 5 3 4 1 ...
$ Installment.rate.... : int 4 2 2 2 3 2 3 2 2 4 ...
$ Personal.status.and.sex : Factor w/ 5 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
$ Guarantors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
$ Present.residence.since : int 4 2 3 4 4 4 4 2 4 2 ...
$ Most.valuable.available.asset : Factor w/ 4 levels "real estate",..: 1 1 1 2 4 4 2 3 1 3 ...
$ Age.in.years : int 67 22 49 45 53 35 53 35 61 28 ...
$ Concurrent.Credits : Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Type.of.housing : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ...
$ Number.of.existing.credits.at.this.bank: int 2 1 1 1 2 1 1 1 1 2 ...
$ Job : Factor w/ 4 levels "unemployed/ unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
$ No.of.dependents : int 1 1 2 2 2 2 1 1 1 1 ...
$ Telephone : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 1 2 1 1 ...
$ Foreign.Worker : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
$ Loan.Quality : int 1 2 1 1 2 1 1 1 1 2 ...
>
Our data is now ready for data visualization.
Leave a Reply