• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Finance Train

Finance Train

High Quality tutorials for finance, risk, data science

  • Home
  • Data Science
  • CFA® Exam
  • PRM Exam
  • Tutorials
  • Careers
  • Products
  • Login

Data Cleaning in R – Part 2

Data Science, Risk Management

This lesson is part 16 of 28 in the course Credit Risk Modelling in R

Attributes with Zero Variance

Datasets can sometimes contain attributes (predictors) that have near-zero variance, or may have just one value. Such variables are considered to have less predictor power. Apart from being uninformative, these predictors may also sometimes break the model that you are trying to fit to your data. This can occur, for example, due to division by zero (if a standardization is performed in the data).

One quick solution is to remove all predictors that satisfy some threshold criterion related to their variance.

In our dataset, we will look for predictors that have zero variance and will remove them.

We will first define some generic functions that we will use later.

# Returns the Numeric columns from a dataset
getNumericColumns<-function(t){
    tn = sapply(t,function(x){is.numeric(x)})
    return(names(tn)[which(tn)])
}
# Returns the character columns from a dataset
getCharColumns<-function(t){
    tn = sapply(t,function(x){is.character(x)})
    return(names(tn)[which(tn)])
}
# Returns the factor columns in a dataset 
getFactorColumns<-function(t){
    tn = sapply(t,function(x){is.factor(x)})
    return(names(tn)[which(tn)])
}
# Returns index of columns along with the column names
getIndexsOfColumns <- function(t,column_names){
    return(match(column_names,colnames(t)))
}

Now we can find character columns with same value and numeric columns with zero-variance.

tmp = apply(data_train[getCharColumns(data_train)],2,function(x){length(unique(x))})
tmp = tmp[tmp==1]

tmp2 = apply(data_train[getNumericColumns(data_train)],2,function(x){(sd(x))})
tmp2 = tmp2[tmp2==0]

discard_column = c(names(tmp),names(tmp2))

> discard_column
[1] "policy_code"
>

There is only one predictor that meets this criteria. We then proceed to drop this zero variance feature.

data_train = (data_train[,!(names(data_train) %in% discard_column)])

Title, Desc, and Purpose

Let’s look at the attributes ’title’ and ‘purpose’.

> table(data_train$purpose)
               car        credit_card debt_consolidation   home_improvement              house 
               424               9163              24604               2785                197 
    major_purchase            medical             moving              other   renewable_energy 
               939                480                281               2340                 31 
    small_business           vacation 
               404                261 
> table(data_train$title)
                                       Business           Car financing Credit card refinancing 
                   3323                     372                     403                    8292 
     Debt consolidation              Green loan             Home buying        Home improvement 
                  22614                      27                     187                    2614 
         Major purchase        Medical expenses   Moving and relocation                   Other 
                    879                     453                     264                    2239 
               Vacation 
                    242 
>

The variable title and purpose have the same information. So, we can drop one of them. We will drop title.

> data_train$title = NULL

Let’s look at what we have in the desc column.

> str(data_train$desc)
 chr [1:41909] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" …

As you can see it looks mostly empty. We will drop this as well.

> data_train$desc = NULL
Previous Lesson

‹ Data Cleaning in R – Part 1

Next Lesson

Data Cleaning in R – Part 3 ›

Join Our Facebook Group - Finance, Risk and Data Science

Posts You May Like

How to Improve your Financial Health

CFA® Exam Overview and Guidelines (Updated for 2021)

Changing Themes (Look and Feel) in ggplot2 in R

Coordinates in ggplot2 in R

Facets for ggplot2 Charts in R (Faceting Layer)

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

In this Course

  • Credit Risk Modelling – Case Studies
  • Classification vs. Regression Models
  • Case Study – German Credit – Steps to Build a Predictive Model
  • Import Credit Data Set in R
  • German Credit Data : Data Preprocessing and Feature Selection in R
  • Credit Modelling: Training and Test Data Sets
  • Build the Predictive Model
  • Logistic Regression Model in R
  • Measure Model Performance in R Using ROCR Package
  • Create a Confusion Matrix in R
  • Credit Risk Modelling – Case Study- Lending Club Data
  • Explore Loan Data in R – Loan Grade and Interest Rate
  • Credit Risk Modelling – Required R Packages
  • Loan Data – Training and Test Data Sets
  • Data Cleaning in R – Part 1
  • Data Cleaning in R – Part 2
  • Data Cleaning in R – Part 3
  • Data Cleaning in R – Part 5
  • Remove Dimensions By Fitting Logistic Regression
  • Create a Function and Prepare Test Data in R
  • Building Credit Risk Model
  • Credit Risk – Logistic Regression Model in R
  • Support Vector Machine (SVM) Model in R
  • Random Forest Model in R
  • Extreme Gradient Boosting in R
  • Predictive Modelling: Averaging Results from Multiple Models
  • Predictive Modelling: Comparing Model Results
  • How Insurance Companies Calculate Risk

Latest Tutorials

    • Data Visualization with R
    • Derivatives with R
    • Machine Learning in Finance Using Python
    • Credit Risk Modelling in R
    • Quantitative Trading Strategies in R
    • Financial Time Series Analysis in R
    • VaR Mapping
    • Option Valuation
    • Financial Reporting Standards
    • Fraud
Facebook Group

Membership

Unlock full access to Finance Train and see the entire library of member-only content and resources.

Subscribe

Footer

Recent Posts

  • How to Improve your Financial Health
  • CFA® Exam Overview and Guidelines (Updated for 2021)
  • Changing Themes (Look and Feel) in ggplot2 in R
  • Coordinates in ggplot2 in R
  • Facets for ggplot2 Charts in R (Faceting Layer)

Products

  • Level I Authority for CFA® Exam
  • CFA Level I Practice Questions
  • CFA Level I Mock Exam
  • Level II Question Bank for CFA® Exam
  • PRM Exam 1 Practice Question Bank
  • All Products

Quick Links

  • Privacy Policy
  • Contact Us

CFA Institute does not endorse, promote or warrant the accuracy or quality of Finance Train. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

Copyright © 2021 Finance Train. All rights reserved.

  • About Us
  • Privacy Policy
  • Contact Us