Handling Missing Values in Time Series

In the examples we saw earlier, we had good quality data with all values available for all time indexes. However, in real life, the data may contain missing values which will influence our analysis. Depending on the nature of data, we may choose to ignore missing values. However, in some cases it might be more suitable to estimate and fill the missing values. Data scientists use various techniques to estimate missing values. One common technique is to take the mean of the time series and replace NA with the mean value. Depending on the data this may or may not be suitable. For example, if the data is about loan borrowers and there are missing values in the loan interest rate, then the data scientist may decide to use the average interest rates for missing values, or if he clearly sees a pattern such as the interest rates being higher for self-employed individuals compared to salaries individuals, then the data scientist may decide to fill the missing values with means based on the categories based on their employment status.

Handling Missing Values in R

We will use our GDP data example to understand how we can estimate and fill missing values in R. Since we don't have access to a real dataset with missing values, we will make one. We will create a copy of our GDP_data dataset and then deliberately turn some values to NA. This is done by the following code:

> GDP_mod <-GDP_data
> GDP_mod[7] <- NA
> GDP_mod
Qtr1    Qtr2    Qtr3    Qtr4
2014 17025.2 17285.6 17569.4 17692.2
2015 17783.6 17998.3      NA 18222.8
2016 18281.6 18450.1 18675.3 18869.4
>


As you can see, we have made the GDP value for 2015 Q3 blank.

Plotting the Time Series

We can plot the new time series using the plot.ts() function.

> plot.ts(GDP_mod)
>


As you can see the value of 2015 Q3 is missing.

Calculate GDP Mean

We will fill the missing value with the average of GDP over the three years. In R, we can calculate the mean using the mean() function. However, in this case, the mean() function will fail because of the presence of missing values. We can remove the missing values using the parameter na.rm=TRUE to calculate the mean by removing all missing values.

> #Calculating mean() will fail and return NA
> mean(GDP_mod)
[1] NA
> #Calculate mean by removing all missing values
> mean(GDP_mod,na.rm=TRUE)
[1] 17986.68
>


Replace NA's with Mean

Now that we have the mean of the series, we can replace the missing values with the mean value as shown below:

> #Replace missing values with mean
> GDP_mod[7] <- mean(GDP_mod, na.rm = TRUE)
> print the series and notice that the missing value is now filled.
> print(GDP_mod)
Qtr1     Qtr2     Qtr3     Qtr4
2014 17025.20 17285.60 17569.40 17692.20
2015 17783.60 17998.30 17986.68 18222.80
2016 18281.60 18450.10 18675.30 18869.40
>


Plot Both Original and the Modified Series

We will now plot both the original GDP_data and the modified GDP_mod time series to see how well the mean estimates the original value.

> plot(GDP_data)
> points(GDP_mod, type = "l", col = 2, lty = 3)
>


In the below chart, the black line represents the original data and the red dotted line represents the modified data. As we can see, the mean is not really a good indicator of the actual GDP in that quarter.