In the post, we will learn about how to create a classifier model in machine learning using python. We will create a supervised classifier model that will train a dataset with a set of features and then use test data to predict price direction at day k with information only known at day k-1. Price direction can be up when the closing price at t is higher than the price at t-1, and down when the closing price at t is lower than at t-1.
For this task we create a set of features that are the lagged returns for the previous 2 days and volume percent change in each day. Then we train the dataset and fit different models with a set of algorithms that are the Logistic Regression, Support Vector Machine, Support Vector Classifier, Random Forest and Linear Discriminant Analysis.
For each of the model we will output two metrics that are used in classification problems to assess model performance. These metrics are the Hit Rate and the Confusion Matrix.
The Hit Rate provides a measure of the percentage of the number of times, the classifier make correct predictions (up and down). This indicator can be expressed with the following formula:
The Confusion Matrix gives a measure of how many times the classifier predicts up correctly and how many times did predict down correctly.
In a binary classification problem, the confusion matrix is a 2 x 2 contingency table that determine the False Positive Rate (Type 1 error. When incorrectly reject a True null hypothesis. UF in the contingency table) and the False Negative Rate (Type II error. When fail to reject the null hypothesis. DF in the contingency table).
UT UF DF DT
UT represents correctly classify up periods, UF represents incorrectly classify up periods (they were classified as down periods), DF represents incorrectly classify down periods (they were classified us up periods), and DT represents correctly classify down periods.
Scikit Learn library provides methods to calculate the Hit Rate and the Confusion Matrix for a classifier. The dataset of the model is the SPY data between 2015-01-01 to 2019-09-18. We will load the dataset into the environment.
The python example below contains the following steps:
- Step 1: Import libraries, and load SPY data into the environment with the read_csv() function. This is the dataset of the model with dates between 2015-01-01 to 2019-09-18
- Step 2: Create features and target variable with the model_variables() function
- Step 3: Fit different models: Logistic Regression, Random Forest, Support Vector Machine, Linear Discriminant Analysis
- Step 4: Obtain metrics such as Confusion Matrix and Hit Rate for each of the models.
import pandas as pd import numpy as np import datetime from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.metrics import confusion_matrix from sklearn.svm import LinearSVC, SVC data = pd.read_csv("C:/Users/Nicolas/Documents/Machine Learning Course/spy_data.csv",index_col='Date') def model_variables(prices,lags): ''' Parameters: prices: dataframe with historical data of SPY with the variables closes and volume. lags: Number of lags of the closing price that will be created by the function to make the lagged returns features of the machine learning model Output: tsret: dataframe with index date, the independent variables(X) and the dependent variable(y) of the machine learning model ''' # Change data types of prices dataframe from object to numeric prices = prices.apply(pd.to_numeric) # Create the new lagged DataFrame inputs = pd.DataFrame(index=prices.index) inputs["Close"] = prices["Close"] inputs["Volume"] = prices["Volume"] # Create the shifted lag series of prior trading period close values for i in range(0, lags): tsret = pd.DataFrame(index=inputs.index) inputs["Lag%s" % str(i+1)] = prices["Close"].shift(i+1) #Create the returns DataFrame tsret["VolumeChange"] =inputs["Volume"].pct_change() tsret["returns"] = inputs["Close"].pct_change()*100.0 # If any of the values of percentage returns equal zero, set them to # a small number (stops issues with QDA model in Scikit-Learn) for i,x in enumerate(tsret["returns"]): if (abs(x) < 0.0001): tsret["returns"][i] = 0.0001 # Create the lagged percentage returns columns for i in range(0, lags): tsret["Lag%s" % str(i+1)] = \ inputs["Lag%s" % str(i+1)].pct_change()*100.0 # Create the "Direction" column (+1 or -1) indicating an up/down day tsret = tsret.dropna() tsret["Direction"] = np.sign(tsret["returns"]) # Convert index to datetime in order to filter the dataframe by dates when # we create the train and test dataset tsret.index = pd.to_datetime(tsret.index) return tsret # Pass the dataset(data) and the number of lags 2 as the inputs of the model_variables function variables_data = model_variables(data,2) # Use the prior two days of returns and the volume change as predictors # values, with direction as the response dataset = variables_data[["Lag1","Lag2","VolumeChange","Direction"]] dataset = dataset.dropna() # Create the dataset with independent variables (X) and dependent variable y X = dataset[["Lag1","Lag2","VolumeChange"]] y = dataset["Direction"] # Split the train and test dataset using the date in the date_split variable # This will create a train dataset of 4 years data and a test dataset for more than # 9 months data. date_split = datetime.datetime(2019,1,1) X_train = X[X.index <= date_split] X_test = X[X.index > date_split] y_train = y[y.index <= date_split] y_test = y[y.index > date_split] # Create the (parametrised) models print("Hit Rates/Confusion Matrices:\n") models = [("LR", LogisticRegression()), ("LDA", LDA()), ("LSVC", LinearSVC()), ("RSVM", SVC( C=1000000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) ), ("RF", RandomForestClassifier( n_estimators=1000, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=30, max_features='auto', bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0) )] # Iterate through the models and obtain the accuracy metrix: Hit Rate and Consusion Matrix for m in models: # Train each of the models on the training set m.fit(X_train, y_train) # Make an array of predictions on the test set pred = m.predict(X_test) # Output the hit-rate and the confusion matrix for each model print("%s:\n%0.3f" % (m, m.score(X_test, y_test))) print("%s\n" % confusion_matrix(pred, y_test))
The results of the code are the Hit Rate and the Confusion Matrix for each of the models trained. The diagonal of the matrix represent the correct predictions (up and down), and the inverse of the diagonal represents incorrect predictions (the prediction was down and the price go up, or the prediction was up and the price go down).
LR: 0.583 [[28 29] [46 77]] LDA: 0.589 [[28 28] [46 78]] LSVC: 0.583 [[28 29] [46 77]] RSVM: 0.572 [[20 23] [54 83]] RF: 0.600 [[34 32] [40 74]]
The model with the best prediction score is the Random Forest with a Hit Rate of 60%. We have changed the default parameter min_samples_leaf of the Random Forest Classifier from it default value of 1 to 30.
This means that each leaf node has at least 30 observations and that a split will be considered if it leaves at least 30 training samples in each left and right branches. All the models work well to predict down periods respect up periods, as the true positive rate for the “down” days (DTDT+UF) is significantly higher than the true positive rate for the “up” days (UTUT+DF).