One quick solution is to remove all predictors that satisfy some threshold criterion related to their variance.
In our dataset, we will look for predictors that have zero variance and will remove them.
We will first define some generic functions that we will use later.
1# Returns the Numeric columns from a dataset
2getNumericColumns<-function(t){
3 tn = sapply(t,function(x){is.numeric(x)})
4 return(names(tn)[which(tn)])
5}
6
1# Returns the character columns from a dataset
2getCharColumns<-function(t){
3 tn = sapply(t,function(x){is.character(x)})
4 return(names(tn)[which(tn)])
5}
6
1# Returns the factor columns in a dataset
2getFactorColumns<-function(t){
3 tn = sapply(t,function(x){is.factor(x)})
4 return(names(tn)[which(tn)])
5}
6
1# Returns index of columns along with the column names
2getIndexsOfColumns <- function(t,column_names){
3 return(match(column_names,colnames(t)))
4}
5
Now we can find character columns with same value and numeric columns with zero-variance.
1tmp = apply(data_train[getCharColumns(data_train)],2,function(x){length(unique(x))})
2tmp = tmp[tmp==1]
3
4tmp2 = apply(data_train[getNumericColumns(data_train)],2,function(x){(sd(x))})
5tmp2 = tmp2[tmp2==0]
6
7discard_column = c(names(tmp),names(tmp2))
8
9> discard_column
10[1] "policy_code"
11>
12
There is only one predictor that meets this criteria. We then proceed to drop this zero variance feature.
1data_train = (data_train[,!(names(data_train) %in% discard_column)])
2
Title, Desc, and Purpose
Let’s look at the attributes ’title’ and ‘purpose’.
1> table(data_train$purpose)
2 car credit_card debt_consolidation home_improvement house
3 424 9163 24604 2785 197
4 major_purchase medical moving other renewable_energy
5 939 480 281 2340 31
6 small_business vacation
7 404 261
8> table(data_train$title)
9 Business Car financing Credit card refinancing
10 3323 372 403 8292
11 Debt consolidation Green loan Home buying Home improvement
12 22614 27 187 2614
13 Major purchase Medical expenses Moving and relocation Other
14 879 453 264 2239
15 Vacation
16 242
17>
18
The variable title and purpose have the same information. So, we can drop one of them. We will drop title.
> data_train$title = NULL
Let’s look at what we have in the desc column.
1> str(data_train$desc)
2 chr [1:41909] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" …
3
As you can see it looks mostly empty. We will drop this as well.
> data_train$desc = NULL