• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Finance Train

Finance Train

High Quality tutorials for finance, risk, data science

  • Home
  • Data Science
  • CFA® Exam
  • PRM Exam
  • Tutorials
  • Careers
  • Products
  • Login

Classifier Model in Machine Learning Using Python

Data Science

This lesson is part 17 of 22 in the course Machine Learning in Finance Using Python

In the post, we will learn about how to create a classifier model in machine learning using python. We will create a supervised classifier model that will train a dataset with a set of features and then use test data to predict price direction at day k with information only known at day k-1. Price direction can be up when the closing price at t is higher than the price at t-1, and down when the closing price at t is lower than at t-1.

For this task we create a set of features that are the lagged returns for the previous 2 days and volume percent change in each day.  Then we train the dataset and fit different models with a set of algorithms that are the Logistic Regression, Support Vector Machine, Support Vector Classifier, Random Forest and Linear Discriminant Analysis.

For each of the model we will output two metrics that are used in classification problems to assess model performance. These metrics are the Hit Rate and the Confusion Matrix. 

Hit Rate

The Hit Rate provides a measure of the percentage of the number of times, the classifier make correct predictions (up and down). This indicator can be expressed with the following formula:

Confusion Matrix

The Confusion Matrix gives a measure of how many times the classifier predicts up correctly and how many times did predict down correctly. 

In a binary classification problem, the confusion matrix is a 2 x 2 contingency table that determine the False Positive Rate (Type 1 error. When incorrectly reject a True null hypothesis. UF in the contingency table) and the False Negative Rate (Type II error. When fail to reject the null hypothesis. DF in the contingency table).

UT UF DF DT

UT represents correctly classify up periods, UF represents incorrectly classify up periods (they were classified as down periods), DF represents incorrectly classify down periods (they were classified us up periods), and DT represents correctly classify down periods.

Scikit Learn library provides methods to calculate the Hit Rate and the Confusion Matrix for a classifier. The dataset of the model is the SPY data between 2015-01-01 to 2019-09-18. We will load the dataset into the environment.

spy_dataDownload

The python example below contains the following steps:

  • Step 1: Import libraries, and load SPY data into the environment with the  read_csv() function. This is the dataset of the model with dates  between 2015-01-01 to 2019-09-18
  • Step 2: Create features and target variable with the model_variables() function
  • Step 3: Fit different models: Logistic Regression, Random Forest, Support Vector Machine, Linear Discriminant Analysis 
  • Step 4: Obtain metrics such as Confusion Matrix and Hit Rate for each of the models.
import pandas as pd 
import numpy as np
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC, SVC
data = pd.read_csv("C:/Users/Nicolas/Documents/Machine Learning Course/spy_data.csv",index_col='Date')
 
def model_variables(prices,lags):
    '''
    Parameters:
        prices: dataframe with historical data of SPY with the variables closes and 
        volume.
        lags: Number of lags of the closing price that will be created by the function
        to make the lagged returns features of the machine learning model
    Output:
        tsret: dataframe with index date, the independent variables(X) and the 
        dependent variable(y) of the machine learning model
    '''
    
    # Change data types of prices dataframe from object to numeric
    prices = prices.apply(pd.to_numeric)
    # Create the new lagged DataFrame
    inputs = pd.DataFrame(index=prices.index)
    
    inputs["Close"] = prices["Close"]
    inputs["Volume"] = prices["Volume"]
    # Create the shifted lag series of prior trading period close values
    for i in range(0, lags):
        tsret = pd.DataFrame(index=inputs.index)
        inputs["Lag%s" % str(i+1)] = prices["Close"].shift(i+1)
   
   #Create the returns DataFrame
    tsret["VolumeChange"] =inputs["Volume"].pct_change()
    tsret["returns"] = inputs["Close"].pct_change()*100.0
        
    # If any of the values of percentage returns equal zero, set them to
    # a small number (stops issues with QDA model in Scikit-Learn)
    for i,x in enumerate(tsret["returns"]):
        if (abs(x) < 0.0001):
            tsret["returns"][i] = 0.0001
    
    # Create the lagged percentage returns columns
    for i in range(0, lags):
        tsret["Lag%s" % str(i+1)] = \
          inputs["Lag%s" % str(i+1)].pct_change()*100.0
    
    # Create the "Direction" column (+1 or -1) indicating an up/down day
    tsret = tsret.dropna()
    tsret["Direction"] = np.sign(tsret["returns"])
    
    # Convert index to datetime in order to filter the dataframe by dates when 
    # we create the train and test dataset
    tsret.index = pd.to_datetime(tsret.index)
    return tsret

# Pass the dataset(data) and the number of lags 2 as the inputs of the model_variables  function
variables_data = model_variables(data,2)
 
# Use the prior two days of returns and the volume change as predictors
# values, with direction as the response
dataset = variables_data[["Lag1","Lag2","VolumeChange","Direction"]]
dataset = dataset.dropna()
 
# Create the dataset with independent variables (X) and dependent variable y
X = dataset[["Lag1","Lag2","VolumeChange"]]
y = dataset["Direction"]

# Split the train and test dataset using the date in the date_split variable
# This will create a train dataset of 4 years data and a test dataset for more than 
# 9 months data.
 
date_split = datetime.datetime(2019,1,1)
 
X_train = X[X.index <= date_split]
X_test =  X[X.index > date_split]
y_train = y[y.index <= date_split]
y_test = y[y.index > date_split]
 
# Create the (parametrised) models
print("Hit Rates/Confusion Matrices:\n")
models = [("LR", LogisticRegression()),
              ("LDA", LDA()),
              ("LSVC", LinearSVC()),
              ("RSVM", SVC(
                      C=1000000.0, cache_size=200, class_weight=None,
                      coef0=0.0, degree=3, gamma=0.0001, kernel='rbf',
                      max_iter=-1, probability=False, random_state=None,
                      shrinking=True, tol=0.001, verbose=False)
    ),
    ("RF", RandomForestClassifier(
            n_estimators=1000, criterion='gini',
            max_depth=None, min_samples_split=2,
            min_samples_leaf=30, max_features='auto',
            bootstrap=True, oob_score=False, n_jobs=1,
            random_state=None, verbose=0)
    )]
 
 
# Iterate through the models and obtain the accuracy metrix: Hit Rate and Consusion Matrix
for m in models:
    # Train each of the models on the training set
    m[1].fit(X_train, y_train)
    # Make an array of predictions on the test set
    pred = m[1].predict(X_test)
    # Output the hit-rate and the confusion matrix for each model
 
    print("%s:\n%0.3f" % (m[0], m[1].score(X_test, y_test)))
    print("%s\n" % confusion_matrix(pred, y_test))

The results of the code are the Hit Rate and the Confusion Matrix for each of the models trained. The diagonal of the matrix represent the correct predictions (up and down), and the inverse of the diagonal represents incorrect predictions (the prediction was down and the price go up, or the prediction was up and the price go down).

LR:
0.583
[[28 29]
 [46 77]]

LDA:
0.589
[[28 28]
 [46 78]]

LSVC:
0.583
[[28 29]
 [46 77]]

RSVM:
0.572
[[20 23]
 [54 83]]

RF:
0.600
[[34 32]
 [40 74]]

The model with the best prediction score is the Random Forest with a Hit Rate of 60%. We have changed the default parameter min_samples_leaf  of the Random Forest Classifier  from it default value of 1 to 30. 

This means that each leaf node has at least 30 observations and that a split will be considered if it leaves at least 30 training samples in each left and right branches.   All the models work well to predict down periods respect up periods, as the true positive rate for the “down” days (DTDT+UF) is significantly higher than the true positive rate for the “up” days (UTUT+DF).

Previous Lesson

‹ Multivariate Linear Regression in Python with scikit-learn Library

Next Lesson

Cross Validation to Avoid Overfitting in Machine Learning ›

Join Our Facebook Group - Finance, Risk and Data Science

Posts You May Like

How to Improve your Financial Health

CFA® Exam Overview and Guidelines (Updated for 2021)

Changing Themes (Look and Feel) in ggplot2 in R

Coordinates in ggplot2 in R

Facets for ggplot2 Charts in R (Faceting Layer)

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

In this Course

  • Machine Learning with Python
  • What is Machine Learning?
  • Data Preprocessing in Data Science and Machine Learning
  • Feature Selection in Machine Learning
  • Train-Test Datasets in Machine Learning
  • Evaluate Model Performance – Loss Function
  • Model Selection in Machine Learning
  • Bias Variance Trade Off
  • Supervised Learning Models
  • Multiple Linear Regression
  • Logistic Regression
  • Logistic Regression in Python using scikit-learn Package
  • Decision Trees in Machine Learning
  • Random Forest Algorithm in Python
  • Support Vector Machine Algorithm Explained
  • Multivariate Linear Regression in Python with scikit-learn Library
  • Classifier Model in Machine Learning Using Python
  • Cross Validation to Avoid Overfitting in Machine Learning
  • K-Fold Cross Validation Example Using Python scikit-learn
  • Unsupervised Learning Models
  • K-Means Algorithm Python Example
  • Neural Networks Overview

Latest Tutorials

    • Data Visualization with R
    • Derivatives with R
    • Machine Learning in Finance Using Python
    • Credit Risk Modelling in R
    • Quantitative Trading Strategies in R
    • Financial Time Series Analysis in R
    • VaR Mapping
    • Option Valuation
    • Financial Reporting Standards
    • Fraud
Facebook Group

Membership

Unlock full access to Finance Train and see the entire library of member-only content and resources.

Subscribe

Footer

Recent Posts

  • How to Improve your Financial Health
  • CFA® Exam Overview and Guidelines (Updated for 2021)
  • Changing Themes (Look and Feel) in ggplot2 in R
  • Coordinates in ggplot2 in R
  • Facets for ggplot2 Charts in R (Faceting Layer)

Products

  • Level I Authority for CFA® Exam
  • CFA Level I Practice Questions
  • CFA Level I Mock Exam
  • Level II Question Bank for CFA® Exam
  • PRM Exam 1 Practice Question Bank
  • All Products

Quick Links

  • Privacy Policy
  • Contact Us

CFA Institute does not endorse, promote or warrant the accuracy or quality of Finance Train. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

Copyright © 2021 Finance Train. All rights reserved.

  • About Us
  • Privacy Policy
  • Contact Us