Classifier Model in Machine Learning Using Python

In the post, we will learn about how to create a classifier model in machine learning using python. We will create a supervised classifier model that will train a dataset with a set of features and then use test data to predict price direction at day k with information only known at day k-1. Price direction can be up when the closing price at t is higher than the price at t-1, and down when the closing price at t is lower than at t-1.

For this task we create a set of features that are the lagged returns for the previous 2 days and volume percent change in each day.  Then we train the dataset and fit different models with a set of algorithms that are the Logistic Regression, Support Vector Machine, Support Vector Classifier, Random Forest and Linear Discriminant Analysis.

For each of the model we will output two metrics that are used in classification problems to assess model performance. These metrics are the Hit Rate and the Confusion Matrix

Hit Rate

The Hit Rate provides a measure of the percentage of the number of times, the classifier make correct predictions (up and down). This indicator can be expressed with the following formula:

Confusion Matrix

The Confusion Matrix gives a measure of how many times the classifier predicts up correctly and how many times did predict down correctly. 

In a binary classification problem, the confusion matrix is a 2 x 2 contingency table that determine the False Positive Rate (Type 1 error. When incorrectly reject a True null hypothesis. UF in the contingency table) and the False Negative Rate (Type II error. When fail to reject the null hypothesis. DF in the contingency table).

UT UF DF DT

UT represents correctly classify up periods, UF represents incorrectly classify up periods (they were classified as down periods), DF represents incorrectly classify down periods (they were classified us up periods), and DT represents correctly classify down periods.

Scikit Learn library provides methods to calculate the Hit Rate and the Confusion Matrix for a classifier. The dataset of the model is the SPY data between 2015-01-01 to 2019-09-18. We will load the dataset into the environment.

Downloads

The python example below contains the following steps:

  • Step 1: Import libraries, and load SPY data into the environment with the  read_csv() function. This is the dataset of the model with dates  between 2015-01-01 to 2019-09-18
  • Step 2: Create features and target variable with the model_variables() function
  • Step 3: Fit different models: Logistic Regression, Random Forest, Support Vector Machine, Linear Discriminant Analysis 
  • Step 4: Obtain metrics such as Confusion Matrix and Hit Rate for each of the models.
import pandas as pd 
import numpy as np
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC, SVC
data = pd.read_csv("C:/Users/Nicolas/Documents/Machine Learning Course/spy_data.csv",index_col='Date')
 
def model_variables(prices,lags):
    '''
    Parameters:
        prices: dataframe with historical data of SPY with the variables closes and 
        volume.
        lags: Number of lags of the closing price that will be created by the function
        to make the lagged returns features of the machine learning model
    Output:
        tsret: dataframe with index date, the independent variables(X) and the 
        dependent variable(y) of the machine learning model
    '''
    
    # Change data types of prices dataframe from object to numeric
    prices = prices.apply(pd.to_numeric)
    # Create the new lagged DataFrame
    inputs = pd.DataFrame(index=prices.index)
    
    inputs["Close"] = prices["Close"]
    inputs["Volume"] = prices["Volume"]
    # Create the shifted lag series of prior trading period close values
    for i in range(0, lags):
        tsret = pd.DataFrame(index=inputs.index)
        inputs["Lag%s" % str(i+1)] = prices["Close"].shift(i+1)
   
   #Create the returns DataFrame
    tsret["VolumeChange"] =inputs["Volume"].pct_change()
    tsret["returns"] = inputs["Close"].pct_change()*100.0
        
    # If any of the values of percentage returns equal zero, set them to
    # a small number (stops issues with QDA model in Scikit-Learn)
    for i,x in enumerate(tsret["returns"]):
        if (abs(x) < 0.0001):
            tsret["returns"][i] = 0.0001
    
    # Create the lagged percentage returns columns
    for i in range(0, lags):
        tsret["Lag%s" % str(i+1)] = \
          inputs["Lag%s" % str(i+1)].pct_change()*100.0
    
    # Create the "Direction" column (+1 or -1) indicating an up/down day
    tsret = tsret.dropna()
    tsret["Direction"] = np.sign(tsret["returns"])
    
    # Convert index to datetime in order to filter the dataframe by dates when 
    # we create the train and test dataset
    tsret.index = pd.to_datetime(tsret.index)
    return tsret

# Pass the dataset(data) and the number of lags 2 as the inputs of the model_variables  function
variables_data = model_variables(data,2)
 
# Use the prior two days of returns and the volume change as predictors
# values, with direction as the response
dataset = variables_data[["Lag1","Lag2","VolumeChange","Direction"]]
dataset = dataset.dropna()
 
# Create the dataset with independent variables (X) and dependent variable y
X = dataset[["Lag1","Lag2","VolumeChange"]]
y = dataset["Direction"]

# Split the train and test dataset using the date in the date_split variable
# This will create a train dataset of 4 years data and a test dataset for more than 
# 9 months data.
 
date_split = datetime.datetime(2019,1,1)
 
X_train = X[X.index <= date_split]
X_test =  X[X.index > date_split]
y_train = y[y.index <= date_split]
y_test = y[y.index > date_split]
 
# Create the (parametrised) models
print("Hit Rates/Confusion Matrices:\n")
models = [("LR", LogisticRegression()),
              ("LDA", LDA()),
              ("LSVC", LinearSVC()),
              ("RSVM", SVC(
                      C=1000000.0, cache_size=200, class_weight=None,
                      coef0=0.0, degree=3, gamma=0.0001, kernel='rbf',
                      max_iter=-1, probability=False, random_state=None,
                      shrinking=True, tol=0.001, verbose=False)
    ),
    ("RF", RandomForestClassifier(
            n_estimators=1000, criterion='gini',
            max_depth=None, min_samples_split=2,
            min_samples_leaf=30, max_features='auto',
            bootstrap=True, oob_score=False, n_jobs=1,
            random_state=None, verbose=0)
    )]
 
 
# Iterate through the models and obtain the accuracy metrix: Hit Rate and Consusion Matrix
for m in models:
    # Train each of the models on the training set
    m[1].fit(X_train, y_train)
    # Make an array of predictions on the test set
    pred = m[1].predict(X_test)
    # Output the hit-rate and the confusion matrix for each model
 
    print("%s:\n%0.3f" % (m[0], m[1].score(X_test, y_test)))
    print("%s\n" % confusion_matrix(pred, y_test))

The results of the code are the Hit Rate and the Confusion Matrix for each of the models trained. The diagonal of the matrix represent the correct predictions (up and down), and the inverse of the diagonal represents incorrect predictions (the prediction was down and the price go up, or the prediction was up and the price go down).

LR:
0.583
[[28 29]
 [46 77]]

LDA:
0.589
[[28 28]
 [46 78]]

LSVC:
0.583
[[28 29]
 [46 77]]

RSVM:
0.572
[[20 23]
 [54 83]]

RF:
0.600
[[34 32]
 [40 74]]

The model with the best prediction score is the Random Forest with a Hit Rate of 60%. We have changed the default parameter min_samples_leaf  of the Random Forest Classifier  from it default value of 1 to 30. 

This means that each leaf node has at least 30 observations and that a split will be considered if it leaves at least 30 training samples in each left and right branches.   All the models work well to predict down periods respect up periods, as the true positive rate for the “down” days (DTDT+UF) is significantly higher than the true positive rate for the “up” days (UTUT+DF).

Lesson Resources

All Users
Membership
Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.
I WANT TO JOIN
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.