• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Finance Train

Finance Train

High Quality tutorials for finance, risk, data science

  • Home
  • Data Science
  • CFA® Exam
  • PRM Exam
  • Tutorials
  • Careers
  • Products
  • Login

Check If Data Is Normally Distributed Using R – QQ Plots

Data Science, Statistics

The first step to check if your data is normally distributed is to plot a histogram and observe its shape. If it looks bell-shaped and symmetric around the mean you can assume that your data is normally distributed. However, using histograms to assess normality of data can be problematic especially if you have small dataset.

A better way to check if your data is normally distributed is to create quantile-quantile (QQ) plots which can easily be created in R or Python.

QQ Plots

The idea of a quantile-quantile plot is to compare the distribution of two datasets. It is done by matching a common set of quantiles in the two datasets. 

In R, a QQ plot can be constructed using the qqplot() function which takes two datasets as its parameters. In R, when you create a qq plot, this is what happens. First the data in both datasets is sorted. These sorted values are then plotted against each other in a scatter chart. This is the qq-plot. A 45 degree line is also drawn to make the interpretation easier.

In finance, qq plots are used to determine if the distribution of returns is normal. They are also used to detect fat tails of the distribution. 

To check for normality, instead of comparing two sample datasets, you compare your returns dataset with a theoretical sample that is normally distributed. To do so, you can first create a normally distributed sample dataset and use the qqplot() function to create the qq plot of the two datasets. Or you can you a special function called qqnorm().

The qqnorm() function in R compares a certain sample data (in this case returns), against the values that come from a normal distribution. The sample you want to plot should go as the first argument of the qqnorm() function. Using this function it is possible to observe how closely a certain sample follows a theoretical normal distribution function. It is like a visualization check of the normal distribution test.

qq means quantile-quantile. This refer that the quantiles of your data are compared with the quantiles from a normal distribution (in the qqnorm function) using a scatter plot. Quantile is the fraction of points below the given value. This means that the 0.4 (or 40%) quantile is the point at which 40% percent of the data fall below, and 60% fall above that value.

The qqline() function is used in conjuntion with qqnorm() to plot the theoretical line (45 degree line) of the normal distribution function. If most of the points of the sample data fall along this theoretical line, it is likely that your sample data has a normal distribution. Otherwise, when your sample data departs or diverge significantly from this 45 degree line, the sample data doesn’t follow a normal distribution.

As an exploratory task, we will use the futures historical price data of WTI Crude Oil and plot the quantiles and the histogram of the returns of the Last field column in the dataframe. 

Prepare Data

The first thing we need is the data. We will use the Quandl() api to download data for WTI Crude Oil. The data contains, Open, Close, Low, High, Last, Volume, etc. We will use the last price column and calculate the returns based on these Last prices. The code for preparing the data is shown below:

#  Get Historical Futures Prices: Crude Oil Futures from Quandl. 
#  Contract Code is CL

CME_CL_Data <- Quandl('CHRIS/CME_CL1')

# Inverse order of CME_CL_Data dataframe to have the oldest data at first and the latest data at the bottom. We would use the arrange command from dplyr to achieve this.
 
library(dplyr) # If not loaded
 
# The new data frame CME_CL_Data_ starts from the oldest data of the CL future continuous series from Quandl
 
CME_CL_Data_ <- CME_CL_Data %>% arrange(rev(rownames(.)))
 

# Use the arrange function to organize data by date. This is needed because sometimes when you reverse the data, you will observe that the dates are not consecutive
 
CME_CL_Data_ <- CME_CL_Data_ %>%
  mutate(date = as.Date(Date, "%d-%m-%Y")) %>%
  arrange(date)


# Calculate the returns. The following command will add a returns column to the dataset.
 
CME_CL_Data_$returns <- as.numeric(c('NA',diff(log(CME_CL_Data_$Last))))

Our retruns data is now ready and we can proceed with the creation of qqplot and histogram.

# we will plot two graphs in a single plot. The par(mfrow=c(2,1)) allows us to setup the plotting area into a 2x1 array - 2 rows and 1 column
 
par(mfrow=c(2, 1))
 
# Define the ‘returns’ vector with the values of the returns column from  CME_CL_DATA_ 
 
returns <- CME_CL_Data_$returns
 
# Compare returns quantiles to quantiles of a normal distribution using the qqnorm and qqline commands that plot the quantiles of the series and a quantiles of a normal distribution as a theoretical line
 
qqnorm(returns, main="CL Returns")
 
qqline(returns, col="red")
 
# Generate a histogram with the returns 
 
ret_hist <- hist(returns, breaks=50,col='red')

The create graphs look as follows:

Both the qqplot and the histogram show that the futures prices for CL contract are far from a normal distribution, as they have fat tails at the right and left sides of the histogram and a deviation from the theoretical quantiles line in the qqplot. The histogram shows leptokurtic shape with fat tails and peaks.

Join Our Facebook Group - Finance, Risk and Data Science

Posts You May Like

How to Improve your Financial Health

CFA® Exam Overview and Guidelines (Updated for 2021)

Changing Themes (Look and Feel) in ggplot2 in R

Coordinates in ggplot2 in R

Facets for ggplot2 Charts in R (Faceting Layer)

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Latest Tutorials

    • Data Visualization with R
    • Derivatives with R
    • Machine Learning in Finance Using Python
    • Credit Risk Modelling in R
    • Quantitative Trading Strategies in R
    • Financial Time Series Analysis in R
    • VaR Mapping
    • Option Valuation
    • Financial Reporting Standards
    • Fraud
Facebook Group

Membership

Unlock full access to Finance Train and see the entire library of member-only content and resources.

Subscribe

Footer

Recent Posts

  • How to Improve your Financial Health
  • CFA® Exam Overview and Guidelines (Updated for 2021)
  • Changing Themes (Look and Feel) in ggplot2 in R
  • Coordinates in ggplot2 in R
  • Facets for ggplot2 Charts in R (Faceting Layer)

Products

  • Level I Authority for CFA® Exam
  • CFA Level I Practice Questions
  • CFA Level I Mock Exam
  • Level II Question Bank for CFA® Exam
  • PRM Exam 1 Practice Question Bank
  • All Products

Quick Links

  • Privacy Policy
  • Contact Us

CFA Institute does not endorse, promote or warrant the accuracy or quality of Finance Train. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

Copyright © 2021 Finance Train. All rights reserved.

  • About Us
  • Privacy Policy
  • Contact Us