# Data Cleaning in R - Part 2

### Attributes with Zero Variance

Datasets can sometimes contain attributes (predictors) that have near-zero variance, or may have just one value. Such variables are considered to have less predictor power. Apart from being uninformative, these predictors may also sometimes break the model that you are trying to fit to your data. This can occur, for example, due to division by zero (if a standardization is performed in the data).

One quick solution is to remove all predictors that satisfy some threshold criterion related to their variance.

In our dataset, we will look for predictors that have zero variance and will remove them.

We will first define some generic functions that we will use later.

```
# Returns the Numeric columns from a dataset
getNumericColumns<-function(t){
tn = sapply(t,function(x){is.numeric(x)})
return(names(tn)[which(tn)])
}
```

```
# Returns the character columns from a dataset
getCharColumns<-function(t){
tn = sapply(t,function(x){is.character(x)})
return(names(tn)[which(tn)])
}
```

```
# Returns the factor columns in a dataset
getFactorColumns<-function(t){
tn = sapply(t,function(x){is.factor(x)})
return(names(tn)[which(tn)])
}
```

```
# Returns index of columns along with the column names
getIndexsOfColumns <- function(t,column_names){
return(match(column_names,colnames(t)))
}
```

Now we can find character columns with same value and numeric columns with zero-variance.

```
tmp = apply(data_train[getCharColumns(data_train)],2,function(x){length(unique(x))})
tmp = tmp[tmp==1]
tmp2 = apply(data_train[getNumericColumns(data_train)],2,function(x){(sd(x))})
tmp2 = tmp2[tmp2==0]
discard_column = c(names(tmp),names(tmp2))
> discard_column
[1] "policy_code"
>
```

There is only one predictor that meets this criteria. We then proceed to drop this zero variance feature.

```
data_train = (data_train[,!(names(data_train) %in% discard_column)])
```

### Title, Desc, and Purpose

Let’s look at the attributes ’title’ and ‘purpose’.

```
> table(data_train$purpose)
car credit_card debt_consolidation home_improvement house
424 9163 24604 2785 197
major_purchase medical moving other renewable_energy
939 480 281 2340 31
small_business vacation
404 261
> table(data_train$title)
Business Car financing Credit card refinancing
3323 372 403 8292
Debt consolidation Green loan Home buying Home improvement
22614 27 187 2614
Major purchase Medical expenses Moving and relocation Other
879 453 264 2239
Vacation
242
>
```

The variable title and purpose have the same information. So, we can drop one of them. We will drop title.

```
> data_train$title = NULL
```

Let’s look at what we have in the desc column.

```
> str(data_train$desc)
chr [1:41909] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" …
```

As you can see it looks mostly empty. We will drop this as well.

```
> data_train$desc = NULL
```

#### Course Downloads

- Credit Risk Modelling - Case Studies
- Classification vs. Regression Models
- Case Study - German Credit - Steps to Build a Predictive Model
- Import Credit Data Set in R
- German Credit Data : Data Preprocessing and Feature Selection in R
- Credit Modelling: Training and Test Data Sets
- Build the Predictive Model
- Logistic Regression Model in R
- Measure Model Performance in R Using ROCR Package
- Create a Confusion Matrix in R
- Credit Risk Modelling - Case Study- Lending Club Data
- Explore Loan Data in R - Loan Grade and Interest Rate
- Credit Risk Modelling - Required R Packages
- Loan Data - Training and Test Data Sets
- Data Cleaning in R - Part 1
- Data Cleaning in R - Part 2
- Data Cleaning in R - Part 3
- Data Cleaning in R - Part 5
- Remove Dimensions By Fitting Logistic Regression
- Create a Function and Prepare Test Data in R
- Building Credit Risk Model
- Credit Risk - Logistic Regression Model in R
- Support Vector Machine (SVM) Model in R
- Random Forest Model in R
- Extreme Gradient Boosting in R
- Predictive Modelling: Averaging Results from Multiple Models
- Predictive Modelling: Comparing Model Results
- How Insurance Companies Calculate Risk

# R Programming Bundle: 25% OFF

**R Programming - Data Science for Finance Bundle**for just $29 $39.