Data Cleaning in R - Part 2

Premium

Attributes with Zero Variance

Datasets can sometimes contain attributes (predictors) that have near-zero variance, or may have just one value. Such variables are considered to have less predictor power. Apart from being uninformative, these predictors may also sometimes break the model that you are trying to fit to your data. This can occur, for example, due to division by zero (if a standardization is performed in the data).

One quick solution is to remove all predictors that satisfy some threshold criterion related to their variance.

In our dataset, we will look for predictors that have zero variance and will remove them.

We will first define some generic functions that we will use later.

1# Returns the Numeric columns from a dataset
2getNumericColumns<-function(t){
3    tn = sapply(t,function(x){is.numeric(x)})
4    return(names(tn)[which(tn)])
5}
6
1# Returns the character columns from a dataset
2getCharColumns<-function(t){
3    tn = sapply(t,function(x){is.character(x)})
4    return(names(tn)[which(tn)])
5}
6
1# Returns the factor columns in a dataset 
2getFactorColumns<-function(t){
3    tn = sapply(t,function(x){is.factor(x)})
4    return(names(tn)[which(tn)])
5}
6
1# Returns index of columns along with the column names
2getIndexsOfColumns <- function(t,column_names){
3    return(match(column_names,colnames(t)))
4}
5

Now we can find character columns with same value and numeric columns with zero-variance.

1tmp = apply(data_train[getCharColumns(data_train)],2,function(x){length(unique(x))})
2tmp = tmp[tmp==1]
3
4tmp2 = apply(data_train[getNumericColumns(data_train)],2,function(x){(sd(x))})
5tmp2 = tmp2[tmp2==0]
6
7discard_column = c(names(tmp),names(tmp2))
8
9> discard_column
10[1] "policy_code"
11>
12

There is only one predictor that meets this criteria. We then proceed to drop this zero variance feature.

1data_train = (data_train[,!(names(data_train) %in% discard_column)])
2

Title, Desc, and Purpose

Let’s look at the attributes ’title’ and ‘purpose’.

1> table(data_train$purpose)
2               car        credit_card debt_consolidation   home_improvement              house 
3               424               9163              24604               2785                197 
4    major_purchase            medical             moving              other   renewable_energy 
5               939                480                281               2340                 31 
6    small_business           vacation 
7               404                261 
8> table(data_train$title)
9                                       Business           Car financing Credit card refinancing 
10                   3323                     372                     403                    8292 
11     Debt consolidation              Green loan             Home buying        Home improvement 
12                  22614                      27                     187                    2614 
13         Major purchase        Medical expenses   Moving and relocation                   Other 
14                    879                     453                     264                    2239 
15               Vacation 
16                    242 
17>
18

The variable title and purpose have the same information. So, we can drop one of them. We will drop title.

> data_train$title = NULL

Let’s look at what we have in the desc column.

1> str(data_train$desc)
2 chr [1:41909] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""3

As you can see it looks mostly empty. We will drop this as well.

> data_train$desc = NULL