Check If Data Is Normally Distributed Using R - QQ Plots

The first step to check if your data is normally distributed is to plot a histogram and observe its shape. If it looks bell-shaped and symmetric around the mean you can assume that your data is normally distributed. However, using histograms to assess normality of data can be problematic especially if you have small dataset.

A better way to check if your data is normally distributed is to create quantile-quantile (QQ) plots which can easily be created in R or Python.

QQ Plots

The idea of a quantile-quantile plot is to compare the distribution of two datasets. It is done by matching a common set of quantiles in the two datasets. 

In R, a QQ plot can be constructed using the qqplot() function which takes two datasets as its parameters. In R, when you create a qq plot, this is what happens. First the data in both datasets is sorted. These sorted values are then plotted against each other in a scatter chart. This is the qq-plot. A 45 degree line is also drawn to make the interpretation easier.

In finance, qq plots are used to determine if the distribution of returns is normal. They are also used to detect fat tails of the distribution. 

To check for normality, instead of comparing two sample datasets, you compare your returns dataset with a theoretical sample that is normally distributed. To do so, you can first create a normally distributed sample dataset and use the qqplot() function to create the qq plot of the two datasets. Or you can you a special function called qqnorm().

The qqnorm() function in R compares a certain sample data (in this case returns), against the values that come from a normal distribution. The sample you want to plot should go as the first argument of the qqnorm() function. Using this function it is possible to observe how closely a certain sample follows a theoretical normal distribution function. It is like a visualization check of the normal distribution test.

qq means quantile-quantile. This refer that the quantiles of your data are compared with the quantiles from a normal distribution (in the qqnorm function) using a scatter plot. Quantile is the fraction of points below the given value. This means that the 0.4 (or 40%) quantile is the point at which 40% percent of the data fall below, and 60% fall above that value.

The qqline() function is used in conjuntion with qqnorm() to plot the theoretical line (45 degree line) of the normal distribution function. If most of the points of the sample data fall along this theoretical line, it is likely that your sample data has a normal distribution. Otherwise, when your sample data departs or diverge significantly from this 45 degree line, the sample data doesn't follow a normal distribution.

As an exploratory task, we will use the futures historical price data of WTI Crude Oil and plot the quantiles and the histogram of the returns of the Last field column in the dataframe. 

Prepare Data

The first thing we need is the data. We will use the Quandl() api to download data for WTI Crude Oil. The data contains, Open, Close, Low, High, Last, Volume, etc. We will use the last price column and calculate the returns based on these Last prices. The code for preparing the data is shown below:

1#  Get Historical Futures Prices: Crude Oil Futures from Quandl. 
2#  Contract Code is CL
3
4CME_CL_Data <- Quandl('CHRIS/CME_CL1')
5
6# Inverse order of `CME_CL_Data` dataframe to have the oldest data at first and the latest data at the bottom. We would use the arrange command from dplyr to achieve this.
7
8library(dplyr) # If not loaded
9
10# The new data frame `CME_CL_Data` starts from the oldest data of the CL future continuous series from Quandl
11
12CME_CL_Data <- CME_CL_Data %>% arrange(rev(rownames(.)))
13
14# Use the arrange function to organize data by date. This is needed because sometimes when you reverse the data, you will observe that the dates are not consecutive.
15
16``` 
17CME_CL_Data <- CME_CL_Data_ %>%
18  mutate(date = as.Date(Date, "%d-%m-%Y")) %>%
19  arrange(date)
20
21# Calculate the returns. The following command will add a returns column to the dataset.
22
23CME_CL_Data_$returns <- as.numeric(c('NA',diff(log(CME_CL_Data_$Last))))
24

Our returns data is now ready and we can proceed with the creation of qqplot and histogram.

1# we will plot two graphs in a single plot. The par(mfrow=c(2,1)) allows us to setup the plotting area into a 2x1 array - 2 rows and 1 column
2 
3par(mfrow=c(2, 1))
4 
5# Define the ‘returns’ vector with the values of the returns column from  CME\_CL\_DATA\_ 
6 
7returns <- CME\_CL\_Data\_$returns
8
9# Compare returns quantiles to quantiles of a normal distribution using the qqnorm and qqline commands that plot the quantiles of the series and a quantiles of a normal distribution as a theoretical line
10 
11qqnorm(returns, main="CL Returns")
12 
13qqline(returns, col="red")
14 
15# Generate a histogram with the returns 
16 
17ret\_hist <- hist(returns, breaks=50,col='red')
18

The create graphs look as follows:

Both the qqplot and the histogram show that the futures prices for CL contract are far from a normal distribution, as they have fat tails at the right and left sides of the histogram and a deviation from the theoretical quantiles line in the qqplot. The histogram shows leptokurtic shape with fat tails and peaks.