• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Finance Train

Finance Train

High Quality tutorials for finance, risk, data science

  • Home
  • Data Science
  • CFA® Exam
  • PRM Exam
  • Tutorials
  • Careers
  • Products
  • Login

Multiple Linear Regression

Data Science

This lesson is part 10 of 22 in the course Machine Learning in Finance Using Python

The multiple linear regression algorithm states that a response y can be estimated with a set of input features x and an error term ɛ. The model can be expressed with the following mathematical equation:

βTX is the matrix notation of the equation, where βT, X ϵ ʀp+1 and ɛ ~ N(μ,σ2)

βT(transpose of β) and X are both real-valued vectors with dimension p+1 and ɛ is the residual term which represents the difference between the predictions of the model and the true observation of the variable y.

The vector βT = (β0,β1,…βP) stores all the beta coefficients of the model. These coefficients measure how a change on some of the independent variable impact on the dependent or target variable.

The vector X = (1,x1,x2, …xp) hold all the values of the independent variables. Both vectors (T and X) are p+1 dimensional because of the need to include an intercept term.

The goal of the linear regression model is to minimize the difference between the predictions and the real observations of the target variable. For this purpose, a method called Ordinal Least Squares (OLS) is used which will derive the optimal set of coefficients for fitting the model.

Ordinal Least Squares

Formally the OLS model will minimize the Residual Sum of Squares (RSS) between the observations of the target variable and the predictions of the model. The RSS is the loss function metric to assess model performance in the linear regression model and has the following formulation:

Residual Sum of Squares also known as the Sum of Squared Errors (SSE) between the predictions βTxi and the observations yi. With the minimization of this function, it is possible to get the optimal parameter estimation of the vector β.

In matrix notation, the RSS equation is the following:

To get the optimal values of β, it is necessary to derivate RSS respect to β:

Remember that X is a matrix with all the independent variables and has N observations and p features. Therefore, the dimension of X is N (rows) x p+1 (columns).

One assumption of this model is that the matrix XTX should be positive-define. This means that the model is valid only when there are more observations than dimensions. In cases of high-dimensional data (e.g. text document classification), this assumption is not true.

Under the assumption of a positive-definite XTX the differentiated equation is set to zero and the β parameters are calculated:

Later we will show an example using a dataset of Open, High, Low, Close and Volume of the S&P 500 to fit and evaluate a multiple linear regression algorithm using Scikit learn library.

Previous Lesson

‹ Supervised Learning Models

Next Lesson

Logistic Regression ›

Join Our Facebook Group - Finance, Risk and Data Science

Posts You May Like

How to Improve your Financial Health

CFA® Exam Overview and Guidelines (Updated for 2021)

Changing Themes (Look and Feel) in ggplot2 in R

Coordinates in ggplot2 in R

Facets for ggplot2 Charts in R (Faceting Layer)

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

In this Course

  • Machine Learning with Python
  • What is Machine Learning?
  • Data Preprocessing in Data Science and Machine Learning
  • Feature Selection in Machine Learning
  • Train-Test Datasets in Machine Learning
  • Evaluate Model Performance – Loss Function
  • Model Selection in Machine Learning
  • Bias Variance Trade Off
  • Supervised Learning Models
  • Multiple Linear Regression
  • Logistic Regression
  • Logistic Regression in Python using scikit-learn Package
  • Decision Trees in Machine Learning
  • Random Forest Algorithm in Python
  • Support Vector Machine Algorithm Explained
  • Multivariate Linear Regression in Python with scikit-learn Library
  • Classifier Model in Machine Learning Using Python
  • Cross Validation to Avoid Overfitting in Machine Learning
  • K-Fold Cross Validation Example Using Python scikit-learn
  • Unsupervised Learning Models
  • K-Means Algorithm Python Example
  • Neural Networks Overview

Latest Tutorials

    • Data Visualization with R
    • Derivatives with R
    • Machine Learning in Finance Using Python
    • Credit Risk Modelling in R
    • Quantitative Trading Strategies in R
    • Financial Time Series Analysis in R
    • VaR Mapping
    • Option Valuation
    • Financial Reporting Standards
    • Fraud
Facebook Group

Membership

Unlock full access to Finance Train and see the entire library of member-only content and resources.

Subscribe

Footer

Recent Posts

  • How to Improve your Financial Health
  • CFA® Exam Overview and Guidelines (Updated for 2021)
  • Changing Themes (Look and Feel) in ggplot2 in R
  • Coordinates in ggplot2 in R
  • Facets for ggplot2 Charts in R (Faceting Layer)

Products

  • Level I Authority for CFA® Exam
  • CFA Level I Practice Questions
  • CFA Level I Mock Exam
  • Level II Question Bank for CFA® Exam
  • PRM Exam 1 Practice Question Bank
  • All Products

Quick Links

  • Privacy Policy
  • Contact Us

CFA Institute does not endorse, promote or warrant the accuracy or quality of Finance Train. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

Copyright © 2021 Finance Train. All rights reserved.

  • About Us
  • Privacy Policy
  • Contact Us