Data Import and Basic Manipulation in R

To learn data visualization with ggplot2 in R, we will be making use of various datasets. However, one interesting dataset that we will be using quite a lot in this section is the German Credit dataset.

The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.

When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision:

If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

In this section, we will explore the dataset using ggplot2 and create both exploratory as well as explanatory data visualizations. However, later in another course we will also use this dataset to build a predictive credit risk model. To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.

You can download the data from the following link:

Attributes of German Credit Data

Number of Attributes: 20 (7 numerical, 13 categorical).


Attribute	Description	Type
Status of existing checking account	A11 : `< 0` DMA12 : `0 <= ... < 200` DMA13 : `>= 200` DMA14 : No checking account	Qualitative
Duration of Credit (in months)		Numerical
Credit history	A30 : no credits taken/ all credits paid back dulyA31 : all credits at this bank paid back dulyA32 : existing credits paid back duly till nowA33 : delay in paying off in the pastA34 : critical account/other credits existing (not at this bank)	Qualitative
Purpose of Loan	A40 : car (new)A41 : car (used)A42 : furniture/equipmentA43 : radio/televisionA44 : domestic appliancesA45 : repairsA46 : educationA47 : vacationA48 : retrainingA49 : businessA410 : others	Qualitative
Credit amount		Numerical
Savings account/bonds	A61 : `< 100` DMA62 : `100 <= ... < 500` DMA63 : `500 <= ... < 1000` DMA64 : `... >= 1000` DMA65 : unknown/ no savings account	Qualitative
Present employment since	A71 : unemployedA72 : `... < 1` yearA73 : `1 <= ... < 4` yearsA74 : `4 <= ... < 7` yearsA75 : .. `>= 7` years	Qualitative
Installment rate (%)		Numerical
Personal status and sex	A91 : male : divorced/separatedA92 : female : divorced/separated/marriedA93 : male : singleA94 : male : married/widowedA95 : female : single	Qualitative
Guarantors	A101 : noneA102 : co-applicantA103 : guarantor	Qualitative
Present residence since		Numerical
Most valuable available asset	A121 : real estateA122 : if notA121 : building society savings agreement/ life insuranceA123 : if not A121/A122 : car or other, not in attribute 6A124 : unknown / no property	Qualitative
Age in years		Numerical
Concurrent Credits	A141 : bankA142 : storesA143 : none	Qualitative
Type of housing	A151 : rentA152 : ownA153 : for free	Qualitative
Number of existing credits at this bank		Numerical
Job	A171 : unemployed/ unskilled - non-residentA172 : unskilled - residentA173 : skilled employee / officialA174 : management/ self-employed/ highly qualified employee/ officer	Qualitative
No of dependents		Numerical
Telephone	A191 : noneA192 : yes, registered under the customers name	Qualitative
Foreign Worker	A201 : yesA202 : no	Qualitative
Loan Quality	1 : Bad loan2 : Good loan	Qualitative

While we use this data to learn the techniques of data visualizations, we will also be learning other important principles of data science, specially the process of data cleaning. We will work on this data to make it suitable for our analysis and to make visualizations meaningful.

Data Import and Basic Manipulation

Now that we have the data in CSV format, we will first import it into R as a data frame. I first placed the CSV file in a folder of my choice, then updated my working directory to the folder where the data file is stored. Then I used the read.csv() command to import the CSV data into a data frame called df.

1> setwd("C:/Users/FT/Dropbox/FinanceTrain/Courses/Data")
2> getwd()
3[1] "C:/Users/FT/Dropbox/FinanceTrain/Courses/Data"
4> df <- read.csv("german-credit.csv")
5

The data is now loaded into our R session in df dataframe. We can inspect the structure of the dataframe using the str() function.

1> str(df)
2'data.frame':    1000 obs. of  21 variables:
3 $ Status.of.existing.checking.account    : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
4 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
5 $ Credit.history                         : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
6 $ Purpose.of.Loan                        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
7 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
8 $ Savings.account.bonds                  : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
9 $ Present.employment.since               : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
10 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
11 $ Personal.status.and.sex                : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
12 $ Guarantors                             : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
13 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
14 $ Most.valuable.available.asset          : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
15 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
16 $ Concurrent.Credits                     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
17 $ Type.of.housing                        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
18 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
19 $ Job                                    : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
20 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
21 $ Telephone                              : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
22 $ Foreign.Worker                         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
23 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...
24

A few observations about the data:

There are 1000 observations of 21 variables.
R keeps the numeric data as it is (int)
R converts the qualitative data in factors (categories). These are also called levels. Internally, R stores the integer values 1, 2, and 3, and maps the character strings (in alphabetical order, unless I reorder) to these values. For example, the variable Foreign.Worker has two levels, namely, A201 and A202 which correspond to 'Yes' and No' in our data. Internally data is stored as 1 and 2.

Relabeling the Factor Levels

Our factors are labeled using some internal codes which are a bit difficult to remember. They are also not very useful data visualizations. For example, in case of Foreign.Worker, the levels A201 and A202 not very intuitive even though we know that they mean 'Yes and 'No' to signify whether it is a foreign worker or not.

In such situation we can easily rename the levels by supplying a new vector of labels. The following example shows changing the level names from A201 and A202 to Yes and No.

1> levels(df$Foreign.Worker) <- c('Yes','No')
2> str(df)
3'data.frame':    1000 obs. of  21 variables:
4 $ Status.of.existing.checking.account    : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
5 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
6 $ Credit.history                         : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
7 $ Purpose.of.Loan                        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
8 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
9 $ Savings.account.bonds                  : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
10 $ Present.employment.since               : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
11 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
12 $ Personal.status.and.sex                : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
13 $ Guarantors                             : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
14 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
15 $ Most.valuable.available.asset          : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
16 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
17 $ Concurrent.Credits                     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
18 $ Type.of.housing                        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
19 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
20 $ Job                                    : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
21 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
22 $ Telephone                              : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
23 $ Foreign.Worker                         : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
24 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...
25

We can also check the levels of any variable using the levels() function as shown below:

1> levels(df$Foreign.Worker)
2[1] "Yes" "No"
3

Similarly, we will update the level names for all the qualitative variables in the dataframe to suite our requirements. The following script does that.

1levels(df$Status.of.existing.checking.account) <- c('< 0 DM','0 - 200 DM', '>= 200 DM', 'No checking account')
2levels(df$Credit.history) <- c('No Credits Taken','All Credit Paid', 'Existing Credit Paid','Delay in Payment','Critical Account')
3levels(df$Purpose.of.Loan) <- c('car (new)', 'car (used)', 'furniture/equipment', 'radio/television', 'domestic appliances', 'repairs', 'education', 'vacation', 'retraining', 'business', 'others')
4levels(df$Savings.account.bonds) <- c('<  100 DM', '100 - 500 DM', '500 - 1000 DM', '>= 1000 DM', 'No Savings Account')
5levels(df$Present.employment.since) <- c('unemployed', '< 1 year', '1 - 4 years', '4 - 7 years', '>= 7 years')
6levels(df$Personal.status.and.sex) <- c('male : divorced/separated', 'female : divorced/separated/married', 'male : single', 'male : married/widowed', 'female : single')
7levels(df$Guarantors) <- c('none', 'co-applicant', 'guarantor')
8levels(df$Most.valuable.available.asset) <- c('real estate', 'savings agreement/life insurance', 'car or other', 'unknown / no property')
9levels(df$Concurrent.Credits) <- c('bank', 'stores', 'none')
10levels(df$Type.of.housing) <- c('rent', 'own', 'for free')
11levels(df$Job) <- c('unemployed/ unskilled - non-resident', 'unskilled - resident', 'skilled employee / official', 'management/ self-employed')
12levels(df$Telephone) <- c('No','Yes')
13levels(df$Foreign.Worker) <- c('Yes','No')
14

The above script will update the level names for all the variables as per the labels provided for us. Let's inspect the structure again.

1> str(df)
2'data.frame':    1000 obs. of  21 variables:
3 $ Status.of.existing.checking.account    : Factor w/ 4 levels "< 0 DM","0 - 200 DM",..: 1 2 4 1 1 4 4 2 4 2 ...
4 $ Duration.of.Credit..in.months.         : int  6 48 12 42 24 36 24 36 12 30 ...
5 $ Credit.history                         : Factor w/ 5 levels "No Credits Taken",..: 5 3 5 3 4 3 3 3 3 5 ...
6 $ Purpose.of.Loan                        : Factor w/ 11 levels "car (new)","car (used)",..: 5 5 8 4 1 8 4 2 5 1 ...
7 $ Credit.amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
8 $ Savings.account.bonds                  : Factor w/ 5 levels "<  100 DM","100 - 500 DM",..: 5 1 1 1 1 5 3 1 4 1 ...
9 $ Present.employment.since               : Factor w/ 5 levels "unemployed","< 1 year",..: 5 3 4 4 3 3 5 3 4 1 ...
10 $ Installment.rate....                   : int  4 2 2 2 3 2 3 2 2 4 ...
11 $ Personal.status.and.sex                : Factor w/ 5 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
12 $ Guarantors                             : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
13 $ Present.residence.since                : int  4 2 3 4 4 4 4 2 4 2 ...
14 $ Most.valuable.available.asset          : Factor w/ 4 levels "real estate",..: 1 1 1 2 4 4 2 3 1 3 ...
15 $ Age.in.years                           : int  67 22 49 45 53 35 53 35 61 28 ...
16 $ Concurrent.Credits                     : Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
17 $ Type.of.housing                        : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ...
18 $ Number.of.existing.credits.at.this.bank: int  2 1 1 1 2 1 1 1 1 2 ...
19 $ Job                                    : Factor w/ 4 levels "unemployed/ unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
20 $ No.of.dependents                       : int  1 1 2 2 2 2 1 1 1 1 ...
21 $ Telephone                              : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 1 2 1 1 ...
22 $ Foreign.Worker                         : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
23 $ Loan.Quality                           : int  1 2 1 1 2 1 1 1 1 2 ...
24>
25

Our data is now ready for data visualization.

Learn

Resources

Data Import and Basic Manipulation in R - German Credit Dataset

Attributes of German Credit Data

Data Import and Basic Manipulation

Relabeling the Factor Levels

Downloads

Resources

German Credit Dataset

Grammar of Graphics in ggplot

Create ggplot Graph with German Credit Data in R

Data Visualization with R

Data Science for Finance Bundle

Topics