# Check If Data Is Normally Distributed Using R - QQ Plots

The first step to check if your data is normally distributed is to plot a histogram and observe its shape. If it looks bell-shaped and symmetric around the mean you can assume that your data is normally distributed. However, using histograms to assess normality of data can be problematic especially if you have small dataset.

A better way to check if your data is normally distributed is to create quantile-quantile (QQ) plots which can easily be created in R or Python.

### QQ Plots

The idea of a quantile-quantile plot is to compare the distribution of two datasets. It is done by matching a common set of quantiles in the two datasets.

In R, a QQ plot can be constructed using the qqplot() function which takes two datasets as its parameters. In R, when you create a qq plot, this is what happens. First the data in both datasets is sorted. These sorted values are then plotted against each other in a scatter chart. This is the qq-plot. A 45 degree line is also drawn to make the interpretation easier.

In finance, qq plots are used to determine if the distribution of returns is normal. They are also used to detect fat tails of the distribution.

To check for normality, instead of comparing two sample datasets, you compare your returns dataset with a theoretical sample that is normally distributed. To do so, you can first create a normally distributed sample dataset and use the qqplot() function to create the qq plot of the two datasets. Or you can you a special function called qqnorm().

The qqnorm() function in R compares a certain sample data (in this case returns), against the values that come from a normal distribution. The sample you want to plot should go as the first argument of the qqnorm() function. Using this function it is possible to observe how closely a certain sample follows a theoretical normal distribution function. It is like a visualization check of the normal distribution test.

qq means quantile-quantile. This refer that the quantiles of your data are compared with the quantiles from a normal distribution (in the qqnorm function) using a scatter plot. Quantile is the fraction of points below the given value. This means that the 0.4 (or 40%) quantile is the point at which 40% percent of the data fall below, and 60% fall above that value.

The qqline() function is used in conjuntion with qqnorm() to plot the theoretical line (45 degree line) of the normal distribution function. If most of the points of the sample data fall along this theoretical line, it is likely that your sample data has a normal distribution. Otherwise, when your sample data departs or diverge significantly from this 45 degree line, the sample data doesn't follow a normal distribution.

As an exploratory task, we will use the futures historical price data of WTI Crude Oil and plot the quantiles and the histogram of the returns of the Last field column in the dataframe.

### Prepare Data

The first thing we need is the data. We will use the Quandl() api to download data for WTI Crude Oil. The data contains, Open, Close, Low, High, Last, Volume, etc. We will use the last price column and calculate the returns based on these Last prices. The code for preparing the data is shown below:

#  Get Historical Futures Prices: Crude Oil Futures from Quandl.  #  Contract Code is CL

CME_CL_Data <- Quandl('CHRIS/CME_CL1')

# Inverse order of CME_CL_Data dataframe to have the oldest data at first and the latest data at the bottom. We would use the arrange command from dplyr to achieve this.

library(dplyr) # If not loaded

# The new data frame CME_CL_Data_ starts from the oldest data of the CL future continuous series from Quandl

CME_CL_Data_ <- CME_CL_Data %>% arrange(rev(rownames(.)))

# Use the arrange function to organize data by date. This is needed because sometimes when you reverse the data, you will observe that the dates are not consecutive

CME_CL_Data_ <- CME_CL_Data_ %>%   mutate(date = as.Date(Date, "%d-%m-%Y")) %>%   arrange(date)

# Calculate the returns. The following command will add a returns column to the dataset.

CME_CL_Data_$returns <- as.numeric(c('NA',diff(log(CME\_CL\_Data\_$Last))))

Our retruns data is now ready and we can proceed with the creation of qqplot and histogram.

# we will plot two graphs in a single plot. The par(mfrow=c(2,1)) allows us to setup the plotting area into a 2x1 array - 2 rows and 1 column

par(mfrow=c(2, 1))