Data Import and Basic Manipulation in R - German Credit Dataset

To learn data visualization with ggplot2 in R, we will be making use of various datasets. However, one interesting dataset that we will be using quite a lot in this section is the German Credit dataset.

The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.

When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision:

  • If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
  • If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

In this section, we will explore the dataset using ggplot2 and create both exploratory as well as explanatory data visualizations. However, later in another course we will also use this dataset to build a predictive credit risk model. To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.

You can download the data from the following link:

Downloads

Attributes of German Credit Data

Number of Attributes: 20 (7 numerical, 13 categorical).

AttributeDescriptionType
Status of existing checking accountA11 : < 0 DMA12 : 0 <= ... < 200 DMA13 : >= 200 DMA14 : No checking accountQualitative
Duration of Credit (in months)Numerical
Credit historyA30 : no credits taken/ all credits paid back dulyA31 : all credits at this bank paid back dulyA32 : existing credits paid back duly till nowA33 : delay in paying off in the pastA34 : critical account/other credits existing (not at this bank)Qualitative
Purpose of LoanA40 : car (new)A41 : car (used)A42 : furniture/equipmentA43 : radio/televisionA44 : domestic appliancesA45 : repairsA46 : educationA47 : vacationA48 : retrainingA49 : businessA410 : othersQualitative
Credit amountNumerical
Savings account/bondsA61 : < 100 DMA62 : 100 <= ... < 500 DMA63 : 500 <= ... < 1000 DMA64 : ... >= 1000 DMA65 : unknown/ no savings accountQualitative
Present employment sinceA71 : unemployedA72 : ... < 1 yearA73 : 1 <= ... < 4 yearsA74 : 4 <= ... < 7 yearsA75 : .. >= 7 yearsQualitative
Installment rate (%)Numerical
Personal status and sexA91 : male : divorced/separatedA92 : female : divorced/separated/marriedA93 : male : singleA94 : male : married/widowedA95 : female : singleQualitative
GuarantorsA101 : noneA102 : co-applicantA103 : guarantorQualitative
Present residence sinceNumerical
Most valuable available assetA121 : real estateA122 : if notA121 : building society savings agreement/ life insuranceA123 : if not A121/A122 : car or other, not in attribute 6A124 : unknown / no propertyQualitative
Age in yearsNumerical
Concurrent CreditsA141 : bankA142 : storesA143 : noneQualitative
Type of housingA151 : rentA152 : ownA153 : for freeQualitative
Number of existing credits at this bankNumerical
JobA171 : unemployed/ unskilled - non-residentA172 : unskilled - residentA173 : skilled employee / officialA174 : management/ self-employed/ highly qualified employee/ officerQualitative
No of dependentsNumerical
TelephoneA191 : noneA192 : yes, registered under the customers nameQualitative
Foreign WorkerA201 : yesA202 : noQualitative
Loan Quality1 : Bad loan2 : Good loanQualitative

While we use this data to learn the techniques of data visualizations, we will also be learning other important principles of data science, specially the process of data cleaning. We will work on this data to make it suitable for our analysis and to make visualizations meaningful.

Data Import and Basic Manipulation

Now that we have the data in CSV format, we will first import it into R as a data frame. I first placed the CSV file in a folder of my choice, then updated my working directory to the folder where the data file is stored. Then I used the read.csv() command to import the CSV data into a data frame called df.

> setwd("C:/Users/FT/Dropbox/FinanceTrain/Courses/Data")
> getwd()
[1] "C:/Users/FT/Dropbox/FinanceTrain/Courses/Data"
> df <- read.csv("german-credit.csv")

The data is now loaded into our R session in df dataframe. We can inspect the structure of the dataframe using the str() function.

> str(df)
'data.frame':    1000 obs. of  21 variables:
 $ Status.of.existing.checking.account    : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit.history                         : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
 $ Purpose.of.Loan                        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ Savings.account.bonds                  : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
 $ Present.employment.since               : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ Personal.status.and.sex                : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
 $ Guarantors                             : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
 $ Most.valuable.available.asset          : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
 $ Concurrent.Credits                     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Type.of.housing                        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                                    : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone                              : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
 $ Foreign.Worker                         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...

A few observations about the data:

  • There are 1000 observations of 21 variables.
  • R keeps the numeric data as it is (int)
  • R converts the qualitative data in factors (categories). These are also called levels. Internally, R stores the integer values 1, 2, and 3, and maps the character strings (in alphabetical order, unless I reorder) to these values. For example, the variable Foreign.Worker has two levels, namely, A201 and A202 which correspond to 'Yes' and No' in our data. Internally data is stored as 1 and 2.

Relabeling the Factor Levels

Our factors are labeled using some internal codes which are a bit difficult to remember. They are also not very useful data visualizations. For example, in case of Foreign.Worker, the levels A201 and A202 not very intuitive even though we know that they mean 'Yes and 'No' to signify whether it is a foreign worker or not.

In such situation we can easily rename the levels by supplying a new vector of labels. The following example shows changing the level names from A201 and A202 to Yes and No.

> levels(df$Foreign.Worker) <- c('Yes','No')
> str(df)
'data.frame':    1000 obs. of  21 variables:
 $ Status.of.existing.checking.account    : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit.history                         : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
 $ Purpose.of.Loan                        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ Savings.account.bonds                  : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
 $ Present.employment.since               : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ Personal.status.and.sex                : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
 $ Guarantors                             : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
 $ Most.valuable.available.asset          : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
 $ Concurrent.Credits                     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Type.of.housing                        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                                    : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone                              : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
 $ Foreign.Worker                         : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...

We can also check the levels of any variable using the levels() function as shown below:

> levels(df$Foreign.Worker)
[1] "Yes" "No"

Similarly, we will update the level names for all the qualitative variables in the dataframe to suite our requirements. The following script does that.

levels(df$Status.of.existing.checking.account) <- c('< 0 DM','0 - 200 DM', '>= 200 DM', 'No checking account')
levels(df$Credit.history) <- c('No Credits Taken','All Credit Paid', 'Existing Credit Paid','Delay in Payment','Critical Account')
levels(df$Purpose.of.Loan) <- c('car (new)', 'car (used)', 'furniture/equipment', 'radio/television', 'domestic appliances', 'repairs', 'education', 'vacation', 'retraining', 'business', 'others')
levels(df$Savings.account.bonds) <- c('<  100 DM', '100 - 500 DM', '500 - 1000 DM', '>= 1000 DM', 'No Savings Account')
levels(df$Present.employment.since) <- c('unemployed', '< 1 year', '1 - 4 years', '4 - 7 years', '>= 7 years')
levels(df$Personal.status.and.sex) <- c('male : divorced/separated', 'female : divorced/separated/married', 'male : single', 'male : married/widowed', 'female : single')
levels(df$Guarantors) <- c('none', 'co-applicant', 'guarantor')
levels(df$Most.valuable.available.asset) <- c('real estate', 'savings agreement/life insurance', 'car or other', 'unknown / no property')
levels(df$Concurrent.Credits) <- c('bank', 'stores', 'none')
levels(df$Type.of.housing) <- c('rent', 'own', 'for free')
levels(df$Job) <- c('unemployed/ unskilled - non-resident', 'unskilled - resident', 'skilled employee / official', 'management/ self-employed')
levels(df$Telephone) <- c('No','Yes')
levels(df$Foreign.Worker) <- c('Yes','No')

The above script will update the level names for all the variables as per the labels provided for us. Let's inspect the structure again.

> str(df)
'data.frame':    1000 obs. of  21 variables:
 $ Status.of.existing.checking.account    : Factor w/ 4 levels "< 0 DM","0 - 200 DM",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit.history                         : Factor w/ 5 levels "No Credits Taken",..: 5 3 5 3 4 3 3 3 3 5 ...
 $ Purpose.of.Loan                        : Factor w/ 11 levels "car (new)","car (used)",..: 5 5 8 4 1 8 4 2 5 1 ...
 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ Savings.account.bonds                  : Factor w/ 5 levels "<  100 DM","100 - 500 DM",..: 5 1 1 1 1 5 3 1 4 1 ...
 $ Present.employment.since               : Factor w/ 5 levels "unemployed","< 1 year",..: 5 3 4 4 3 3 5 3 4 1 ...
 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ Personal.status.and.sex                : Factor w/ 5 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
 $ Guarantors                             : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
 $ Most.valuable.available.asset          : Factor w/ 4 levels "real estate",..: 1 1 1 2 4 4 2 3 1 3 ...
 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
 $ Concurrent.Credits                     : Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Type.of.housing                        : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ...
 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                                    : Factor w/ 4 levels "unemployed/ unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone                              : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 1 2 1 1 ...
 $ Foreign.Worker                         : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...
>

Our data is now ready for data visualization.

Lesson Resources

All Users

Related Downloads

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $39 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.