• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Finance Train

Finance Train

High Quality tutorials for finance, risk, data science

  • Home
  • Data Science
  • CFA® Exam
  • PRM Exam
  • Tutorials
  • Careers
  • Products
  • Login

Data Import and Basic Manipulation in R – German Credit Dataset

Data Science

This lesson is part 20 of 29 in the course Data Visualization with R

To learn data visualization with ggplot2 in R, we will be making use of various datasets. However, one interesting dataset that we will be using quite a lot in this section is the German Credit dataset.

The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.

When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision:

  • If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
  • If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

In this section, we will explore the dataset using ggplot2 and create both exploratory as well as explanatory data visualizations. However, later in another course we will also use this dataset to build a predictive credit risk model. To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.

You can download the data from the following link:

German Credit DatasetDownload

Attributes of German Credit Data

Number of Attributes: 20 (7 numerical, 13 categorical).

AttributeDescriptionType
Status of existing checking accountA11 : < 0 DM
A12 : 0 <= … < 200 DM
A13 : >= 200 DM
A14 : No checking account
Qualitative
Duration of Credit (in months)Numerical
Credit historyA30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/other credits existing (not at this bank)
Qualitative
Purpose of LoanA40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : vacation
A48 : retraining
A49 : business
A410 : others
Qualitative
Credit amountNumerical
Savings account/bondsA61 : < 100 DM
A62 : 100 <= … < 500 DM
A63 : 500 <= … < 1000 DM
A64 : … >= 1000 DM
A65 : unknown/ no savings account
Qualitative
Present employment sinceA71 : unemployed
A72 : … < 1 year
A73 : 1 <= … < 4 years
A74 : 4 <= … < 7 years
A75 : .. >= 7 years
Qualitative
Installment rate (%)Numerical
Personal status and sexA91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Qualitative
GuarantorsA101 : none
A102 : co-applicant
A103 : guarantor
Qualitative
Present residence sinceNumerical
Most valuable available assetA121 : real estate
A122 : if not
A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Qualitative
Age in yearsNumerical
Concurrent CreditsA141 : bank
A142 : stores
A143 : none
Qualitative
Type of housingA151 : rent
A152 : own
A153 : for free
Qualitative
Number of existing credits at this bankNumerical
JobA171 : unemployed/ unskilled – non-resident
A172 : unskilled – resident
A173 : skilled employee / official
A174 : management/ self-employed/ highly qualified employee/ officer
Qualitative
No of dependentsNumerical
TelephoneA191 : none
A192 : yes, registered under the customers name
Qualitative
Foreign WorkerA201 : yes
A202 : no
Qualitative
Loan Quality1 : Bad loan
2 : Good loan
Qualitative
Attributes of German Credit Dataset

While we use this data to learn the techniques of data visualizations, we will also be learning other important principles of data science, specially the process of data cleaning. We will work on this data to make it suitable for our analysis and to make visualizations meaningful.

Data Import and Basic Manipulation

Now that we have the data in CSV format, we will first import it into R as a data frame. I first placed the CSV file in a folder of my choice, then updated my working directory to the folder where the data file is stored. Then I used the read.csv() command to import the CSV data into a data frame called df.

> setwd("C:/Users/FT/Dropbox/FinanceTrain/Courses/Data")
> getwd()
[1] "C:/Users/FT/Dropbox/FinanceTrain/Courses/Data"
> df <- read.csv("german-credit.csv")
The data is now loaded into our R session in df dataframe. We can inspect the structure of the dataframe using the str() function.
> str(df)
'data.frame':    1000 obs. of  21 variables:
 $ Status.of.existing.checking.account    : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit.history                         : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
 $ Purpose.of.Loan                        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ Savings.account.bonds                  : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
 $ Present.employment.since               : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ Personal.status.and.sex                : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
 $ Guarantors                             : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
 $ Most.valuable.available.asset          : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
 $ Concurrent.Credits                     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Type.of.housing                        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                                    : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone                              : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
 $ Foreign.Worker                         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...

A few observations about the data:

  • There are 1000 observations of 21 variables.
  • R keeps the numeric data as it is (int)
  • R converts the qualitative data in factors (categories). These are also called levels. Internally, R stores the integer values 1, 2, and 3, and maps the character strings (in alphabetical order, unless I reorder) to these values. For example, the variable Foreign.Worker has two levels, namely, A201 and A202 which correspond to ‘Yes’ and No’ in our data. Internally data is stored as 1 and 2.

Relabeling the Factor Levels

Our factors are labeled using some internal codes which are a bit difficult to remember. They are also not very useful data visualizations. For example, in case of Foreign.Worker, the levels A201 and A202 not very intuitive even though we know that they mean ‘Yes and ‘No’ to signify whether it is a foreign worker or not.

In such situation we can easily rename the levels by supplying a new vector of labels. The following example shows changing the level names from A201 and A202 to Yes and No.

> levels(df$Foreign.Worker) <- c('Yes','No')
> str(df)
'data.frame':    1000 obs. of  21 variables:
 $ Status.of.existing.checking.account    : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit.history                         : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
 $ Purpose.of.Loan                        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ Savings.account.bonds                  : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
 $ Present.employment.since               : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ Personal.status.and.sex                : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
 $ Guarantors                             : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
 $ Most.valuable.available.asset          : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
 $ Concurrent.Credits                     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Type.of.housing                        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                                    : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone                              : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
 $ Foreign.Worker                         : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...

We can also check the levels of any variable using the levels() function as shown below:

> levels(df$Foreign.Worker)
[1] "Yes" "No"

Similarly, we will update the level names for all the qualitative variables in the dataframe to suite our requirements. The following script does that.

levels(df$Status.of.existing.checking.account) <- c('< 0 DM','0 - 200 DM', '>= 200 DM', 'No checking account')
levels(df$Credit.history) <- c('No Credits Taken','All Credit Paid', 'Existing Credit Paid','Delay in Payment','Critical Account')
levels(df$Purpose.of.Loan) <- c('car (new)', 'car (used)', 'furniture/equipment', 'radio/television', 'domestic appliances', 'repairs', 'education', 'vacation', 'retraining', 'business', 'others')
levels(df$Savings.account.bonds) <- c('<  100 DM', '100 - 500 DM', '500 - 1000 DM', '>= 1000 DM', 'No Savings Account')
levels(df$Present.employment.since) <- c('unemployed', '< 1 year', '1 - 4 years', '4 - 7 years', '>= 7 years')
levels(df$Personal.status.and.sex) <- c('male : divorced/separated', 'female : divorced/separated/married', 'male : single', 'male : married/widowed', 'female : single')
levels(df$Guarantors) <- c('none', 'co-applicant', 'guarantor')
levels(df$Most.valuable.available.asset) <- c('real estate', 'savings agreement/life insurance', 'car or other', 'unknown / no property')
levels(df$Concurrent.Credits) <- c('bank', 'stores', 'none')
levels(df$Type.of.housing) <- c('rent', 'own', 'for free')
levels(df$Job) <- c('unemployed/ unskilled - non-resident', 'unskilled - resident', 'skilled employee / official', 'management/ self-employed')
levels(df$Telephone) <- c('No','Yes')
levels(df$Foreign.Worker) <- c('Yes','No')

The above script will update the level names for all the variables as per the labels provided for us. Let’s inspect the structure again.

> str(df)
'data.frame':    1000 obs. of  21 variables:
 $ Status.of.existing.checking.account    : Factor w/ 4 levels "< 0 DM","0 - 200 DM",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit.history                         : Factor w/ 5 levels "No Credits Taken",..: 5 3 5 3 4 3 3 3 3 5 ...
 $ Purpose.of.Loan                        : Factor w/ 11 levels "car (new)","car (used)",..: 5 5 8 4 1 8 4 2 5 1 ...
 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ Savings.account.bonds                  : Factor w/ 5 levels "<  100 DM","100 - 500 DM",..: 5 1 1 1 1 5 3 1 4 1 ...
 $ Present.employment.since               : Factor w/ 5 levels "unemployed","< 1 year",..: 5 3 4 4 3 3 5 3 4 1 ...
 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ Personal.status.and.sex                : Factor w/ 5 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
 $ Guarantors                             : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
 $ Most.valuable.available.asset          : Factor w/ 4 levels "real estate",..: 1 1 1 2 4 4 2 3 1 3 ...
 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
 $ Concurrent.Credits                     : Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Type.of.housing                        : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ...
 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                                    : Factor w/ 4 levels "unemployed/ unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone                              : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 1 2 1 1 ...
 $ Foreign.Worker                         : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...
>

Our data is now ready for data visualization.

Previous Lesson

‹ Grammar of Graphics in ggplot

Next Lesson

Create ggplot Graph with German Credit Data in R ›

Join Our Facebook Group - Finance, Risk and Data Science

Posts You May Like

How to Improve your Financial Health

CFA® Exam Overview and Guidelines (Updated for 2021)

Changing Themes (Look and Feel) in ggplot2 in R

Coordinates in ggplot2 in R

Facets for ggplot2 Charts in R (Faceting Layer)

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

In this Course

  • Overview of Data Visualization
  • When to Use Bar Chart, Column Chart, and Area Chart
  • What is Line Chart and When to Use It
  • What are Pie Chart and Donut Chart and When to Use Them
  • How to Read Scatter Chart and Bubble Chart
  • What is a Box Plot and How to Read It
  • Understanding Japanese Candlestick Charts and OHLC Charts
  • Understanding Treemap, Heatmap and Other Map Charts
  • Visualization in Data Science
  • Graphic Systems in R
  • Accessing Built-in Datasets in R
  • How to Create a Scatter Plot in R
  • Create a Scatter Plot in R with Multiple Groups
  • Creating a Bar Chart in R
  • Creating a Line Chart in R
  • Plotting Multiple Datasets on One Chart in R
  • Adding Details and Features to R Plots
  • Introduction to ggplot2
  • Grammar of Graphics in ggplot
  • Data Import and Basic Manipulation in R – German Credit Dataset
  • Create ggplot Graph with German Credit Data in R
  • Splitting Plots with Facets in ggplots
  • ggplot2 – Chart Aesthetics and Position Adjustments in R
  • Creating a Line Chart in ggplot 2 in R
  • Add a Statistical Layer on Line Chart in ggplot2
  • stat_summary for Statistical Summary in ggplot2 R
  • Facets for ggplot2 Charts in R (Faceting Layer)
  • Coordinates in ggplot2 in R
  • Changing Themes (Look and Feel) in ggplot2 in R

Latest Tutorials

    • Data Visualization with R
    • Derivatives with R
    • Machine Learning in Finance Using Python
    • Credit Risk Modelling in R
    • Quantitative Trading Strategies in R
    • Financial Time Series Analysis in R
    • VaR Mapping
    • Option Valuation
    • Financial Reporting Standards
    • Fraud
Facebook Group

Membership

Unlock full access to Finance Train and see the entire library of member-only content and resources.

Subscribe

Footer

Recent Posts

  • How to Improve your Financial Health
  • CFA® Exam Overview and Guidelines (Updated for 2021)
  • Changing Themes (Look and Feel) in ggplot2 in R
  • Coordinates in ggplot2 in R
  • Facets for ggplot2 Charts in R (Faceting Layer)

Products

  • Level I Authority for CFA® Exam
  • CFA Level I Practice Questions
  • CFA Level I Mock Exam
  • Level II Question Bank for CFA® Exam
  • PRM Exam 1 Practice Question Bank
  • All Products

Quick Links

  • Privacy Policy
  • Contact Us

CFA Institute does not endorse, promote or warrant the accuracy or quality of Finance Train. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

Copyright © 2021 Finance Train. All rights reserved.

  • About Us
  • Privacy Policy
  • Contact Us