Data Cleaning in R - Part 2

Attributes with Zero Variance

Datasets can sometimes contain attributes (predictors) that have near-zero variance, or may have just one value. Such variables are considered to have less predictor power. Apart from being uninformative, these predictors may also sometimes break the model that you are trying to fit to your data. This can occur, for example, due to division by zero (if a standardization is performed in the data).

One quick solution is to remove all predictors that satisfy some threshold criterion related to their variance.

In our dataset, we will look for predictors that have zero variance and will remove them.

We will first define some generic functions that we will use later.

# Returns the Numeric columns from a dataset
getNumericColumns<-function(t){
    tn = sapply(t,function(x){is.numeric(x)})
    return(names(tn)[which(tn)])
}
# Returns the character columns from a dataset
getCharColumns<-function(t){
    tn = sapply(t,function(x){is.character(x)})
    return(names(tn)[which(tn)])
}
# Returns the factor columns in a dataset 
getFactorColumns<-function(t){
    tn = sapply(t,function(x){is.factor(x)})
    return(names(tn)[which(tn)])
}
# Returns index of columns along with the column names
getIndexsOfColumns <- function(t,column_names){
    return(match(column_names,colnames(t)))
}

This content is for paid members only.

Join our membership for lifelong unlimited access to all our data science learning content and resources.

Related Downloads