How to Remove Outliers in R

An outlier is an observation in a dataset that lies abnormally away from other observations in the dataset. Based on this definition, it is the analyst who decides what constitutes an outlier in the data that he is working with. The analyst observes the data and outlines the characteristics of what is considered normal for an observation. As an example, if your dataset contains heights of men in a city, you may observe that most men have a height in the range of 5 feet to 6 feet. You may keep some margin and say that anyone who is above 6.5 feet is considered an outlier.

While analyzing data, it is sometimes important to remove these outliers as they tend to skew the statistics you’re calculating. In this article, we will learn about how to identify and remove outliers in R programming.

How to Identify Outliers in R

There are two ways we can identify outliers. One is using Interquartile Range (IQR) and the other is using z-scores.

Let’s look at these two methods:

Interquartile Range

Interquartile range (IQR) is the difference between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). It basically gives us the spread of the middle 50% values in the dataset. For a value to be identified as an outlier the rule of thumb is that the value is outside 1.5 times IQR above the 3rd quartile or 1.5 times IQR below the 1st quartile.

Outlier = Value above Q2 + 1.5IQR Or Value below 1.5IQR

Z-score

We can also identify outliers using z-scores. A z-score reflects how many standard deviations above or below the mean an observation is. A z-score of 2 indicates that the current observation is 2 standard deviations above the mean.

The rule of thumb is that an observation is an outlier if it has a z-score less than -3 or greater than 3.

How to Remove Outliers in R

Now that we know how to identify outliers, let’s learn about how we can remove these outliers from a dataset in R.

For our example, let’s create dataframe that contains some random data.

1    # set seed to 0 so that we can reproduce the results of this example
2    
3    set.seed(0)
4    
5    # create a data frame with one column containing 1000 observations
6    
7    data <- data.frame(X=rnorm(1000, mean=20, sd=3)
8

The rnorm function generates a vector of normally distributed random numbers. In this example it will generate 1000 values with a mean of 20 and a standard deviation of 3.

You can check the first few values of the dataframe using the head command.

1    head(data)
2    
3      X
4    1 23.78886
5    2 19.02130
6    3 23.98940
7    4 23.81729
8    5 21.24392
9    6 15.38015
10

This will give you an idea of the kind of values we have in the dataset.

Now let’s use the two methods to remove the outliers from this dataset.

Remove Outliers using Interquartile Range Method

To identify the outlier, we will first calculate Q1, Q2, the IQR and then compare each value to check if it is above Q3-1.5IQR or if it is below Q1-1.5IQR.

We have formulas for calculating quartiles and IQR in R.

1    #Calculate Q, Q3 and IQR
2    
3    Q1 <- quantile(data$X, .25)
4    
5    Q3 <- quantile(data$X, .75)
6    
7    IQR <- IQR(data$X)
8

Now that we have these statistics, we can apply our rules and create a new dataframe that doesn’t contain any values that are outside our desired range.

1    #Remove rows that have values outside of 1.5*IQR of Q1 and Q3
2    
3    new_data <- subset(data, data$X> (Q1 - 1.5*IQR) & data$X< (Q3 + 1.5*IQR))
4

You can use the dimensions function on this new dataset to see how many values it has left.

1    dim(new_data)
2    
3    [1] 994  1
4

As you can see, it has 994 observations left, which means that the remaining 6 observations were identified as outliers and removed.

Remove Outliers using Z-score Method

We calculate z-score using the following formula:

To identify and remove outliers using the z-score, we will need to calculate the z-score of each value in our dataset, and then remove values that are above 3 or below -3.

In R, we can calculate the absolute values of z-scores for each observation and then remove the rows that have absolute z-score > 3.

1    # set seed to 0
2    
3    set.seed(0)
4    
5    # create a data frame with random values
6    
7    data <- data.frame(X=rnorm(1000, mean=20, sd=3))
8    
9    # Preview the data
10    
11    head(data)
12    
13    # add a new column to the data frame containing the z-score
14    
15    data$zscore <- (abs(data$X-mean(data$X))/sd(data$X))
16    
17    # Check the data again. It should now have two columns: X and zscore
18    
19    head(data)
20    
21    X  zscore
22    
23    1 23.78886 1.2813403
24    2 19.02130 0.3110243
25    3 23.98940 1.3483190
26    4 23.81729 1.2908343
27    5 21.24392 0.4313316
28    6 15.38015 1.5271674
29    
30    # create a new dataframe that contains only those rows 
31    # that have a z-score of below 3
32    
33    new_data <- subset(data, data$zscore < 3)
34    
35    # check the new dataset
36    
37    dim(new_data)
38    
39    [1] 997  2
40

As you can see, 3 outliers have been removed from the data.

There are various reasons why outliers can come into your data. It could be due to human error, or genuine abnormalities, for example, extreme spikes in stock market data. You should be careful about these outliers and remove them if you think that they can impact your analysis.

Learn

Resources

How to Remove Outliers in R

How to Identify Outliers in R

How to Remove Outliers in R

Data Science for Finance Bundle