Explore Financial Data in R
Now that we have the data file in our working directory, we can load it in our R session and start exploring it.
Use the following command to load the data into R.
> loandata = read.csv("loan_data_2017.csv",stringsAsFactors=FALSE)
The “stringsAsFactors = FALSE” parameter turns off the automatic conversion of character strings to factors.
Let's set the seed to 0 so that we can replicate the results.
set.seed(0)
Before we proceed with our analysis, we need to understand what’s contained in the dataset. We may also possibly want to remove unwanted data and transform some data to make it usable for analysis.
Data Summary
Let’s first analyse the structure of the data using str()
or summary()
function.
> str(loandata)
'data.frame': 133889 obs. of 145 variables:
$ id : chr "" "" "" "" ...
$ member_id : logi NA NA NA NA NA NA ...
$ loan_amnt : int 10000 35000 20000 17475 8000 14400 18000 5800 12500 3000 ...
... (List of all variables)
As you can see we have 133889 observations and 145 variables.
Loan Statuses
One of the variables is loan_status
. Let’s look at what unique values we have for loan_status
.
> unique(loandata$loan_status)
[1] "Fully Paid" "Current" "Late (31-120 days)" "Charged Off"
[5] "Late (16-30 days)" "In Grace Period" "Default" ""
>
As we can see, we have 7 loan statuses: Charged Off,Current,Default,Fully Paid,In Grace Period,Late (16-30 days),Late (31-120 days). We consider Late (31-120 days), Default, Charged Off as a default loan and Fully Paid as a desirable loan and ignore everything else.
So, let’s remove the records for loan statuses ‘Current’, ‘In Grace Period’, and ‘Late (16-30 days’. This can be done in many ways, we will use the subset()
function.
> loandata <- subset(loandata, loan_status!='In Grace Period')
> loandata <- subset(loandata, loan_status!='Late (16-30 days)')
> loandata <- subset(loandata, loan_status!='Current')
>
We will also remove records where loan status is empty.
> loandata <- subset(loandata, loan_status!='')
We are now left with only 4 loan statuses, namely, Fully Paid, Late (31-120 days), Charged Off, and Default.
Let us look at the number of each loan status. For this purpose, we will use a package called dplyr
. We can use its group_by
function of dplyr
to look at how many loans are there for each status.
> install.packages("dyplr")
> library(dplyr)
> loandata %>% group_by(loan_status) %>% summarise(count = n())
# A tibble: 4 x 2
loan_status count
<chr> <int>
1 Charged Off 15698
2 Default 9
3 Fully Paid 41833
4 Late (31-120 days) 2329
>
Loan Statuses: Default and Fully Paid
Since, we want to look a only two scenarios, default an no default, we will combine Charged Off, Default, and Late (31-120 days) to a single category: Default
.
Here we will use another R package called stringr
for string manipulations.
install.packages("stringr")
library(stringr)
> loandata$loan_status = ifelse(str_detect(loandata$loan_status,"Paid"),loandata$loan_status,"Default")
In this command, we make use of the str_detect
function of stringr
package to detect the present of the string 'Paid'. Then use ifelse condition to keep those records as is, but change the status of all other records to Default.
We can again look at the number of each loan status to make sure that our data is correct.
> loandata %>% group_by(loan_status) %>% summarise(count = n())
# A tibble: 2 x 2
loan_status count
<chr> <int>
1 Default 18036
2 Fully Paid 41833
>
Plotting Loan Status
We can use the ggplot
package to plot the graph for loan status.
> install.packages("ggplot2")
> library(ggplot2)
> g <- ggplot(loandata, aes(x = loan_status, fill=loan_status))
> g + geom_bar()
In the next lesson, we will start working on our problem of predicting loan default.
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $39 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.