- Machine Learning with Python
- What is Machine Learning?
- Data Preprocessing in Data Science and Machine Learning
- Feature Selection in Machine Learning
- Train-Test Datasets in Machine Learning
- Evaluate Model Performance - Loss Function
- Model Selection in Machine Learning
- Bias Variance Trade Off
- Supervised Learning Models
- Multiple Linear Regression
- Logistic Regression
- Logistic Regression in Python using scikit-learn Package
- Decision Trees in Machine Learning
- Random Forest Algorithm in Python
- Support Vector Machine Algorithm Explained
- Multivariate Linear Regression in Python with scikit-learn Library
- Classifier Model in Machine Learning Using Python
- Cross Validation to Avoid Overfitting in Machine Learning
- K-Fold Cross Validation Example Using Python scikit-learn
- Unsupervised Learning Models
- K-Means Algorithm Python Example
- Neural Networks Overview
Feature Selection in Machine Learning
Feature Selection is one of the core concepts in machine learning and has a high impact on the performance of the model. Irrelevant or partially irrelevant features can negatively impact the model performance.
In this process, those features which contribute most to the prediction variable are selected. In order to get an idea about which features could have more predictive power in a machine learning model, we will load Open, High, Low, Close, Volume (OHLCV) data for AMZ stock, and create some new features using Python.
Afterwards, we will use data visualizations and other common approaches for a smart selection of the features.
We will be performing this process using Python. The example below has 4 main steps:
- Import the Python libraries that will be used.
- Calculate Technical Indicators with the
get_technical_indicators()
function. - Plot scatterplots among features and target variable
_AdjClose._
- Make a Heat Map to show the correlation between each of the features and the target variable.
- Fit a Random Forest Model to extract feature importance of the independent variables (we will explain this algorithm in future sections but now will use some of the tools that Random Forest provides).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
The data for AMZ stock is loaded into the environment using the read_csv()
method.
# To read the file, make sure to provide path to the correct directory in your computer.
amz = pd.read_csv("C:/Users/Nicolas/Documents/Machine_Learning_Course/AMZ.csv")
We use the amz object dataframe which has OHLCV data from AMZ ticker and pass into the get_technical_indicators()
function. This function generates technical indicators to be used as features for a machine learning model.
def get_technical_indicators(dataset):
'''
params:
dataset: OHLCV data for AMZ ticker from 1999-09-01 to 2019-09-20
returns
features dataframe with the calculations of all technical Indicators such as
MACD, 20 period’s standard deviation, ROC, CCI, EMA
'''
# Sort values by dates. Old dates at top
dataset.sort_index(inplace=True)
# Create the features dataframe to store only the features
features = pd.DataFrame(index=dataset.index)
# Create 7 and 21 days Moving Average
features['ma7'] = dataset['AdjClose'].rolling(window=7).mean()
features['ma21'] = dataset['AdjClose'].rolling(window=21).mean()
# Create MACD
features['26ema'] = dataset['AdjClose'].ewm(span=26).mean()
features['12ema'] = dataset['AdjClose'].ewm(span=12).mean()
features['MACD'] = (features['12ema']-features['26ema'])
# Create Bollinger Bands
features['20sd'] = dataset['AdjClose'].rolling(20).std()
features['upper_band'] = features['ma21'] + (features['20sd']*2)
features['lower_band'] = features['ma21'] - (features['20sd']*2)
# Create Exponential moving average
features['ema'] = dataset['AdjClose'].ewm(span=20).mean()
# ROC Rate of Change
N = dataset['AdjClose'].diff(10)
D = dataset['AdjClose'].shift(10)
features['ROC'] = N/D
# CCI Commodity Channel Index
TP = (dataset['High'] + dataset['Low'] + dataset['AdjClose']) / 3
features['CCI'] = (TP - TP.rolling(20).mean()) / (0.015 * TP.rolling(20).std() )
# Create Average True Range
features['TR'] = dataset['High'] - dataset['Low']
features['ATR'] = features['TR'].ewm(span = 10).mean()
return features
# Store the output of get_technical_indicators() function in the features object
features = get_technical_indicators(amz)
The features object stores all the calculations of the get_technical_indicators()
function.
# Retrieve the features that will be used in the vars_ dataframe
vars_ = features[['MACD','20sd','TR','ma21','ROC','CCI']]
# The correlation() function would make scatterplots between each of the features and the target variable AdjClose
jet= plt.get_cmap('jet')
colors = iter(jet(np.linspace(0,1,10)))
def correlation(df,features,variables, n_rows, n_cols):
fig = plt.figure(figsize=(8,6))
#fig = plt.figure(figsize=(14,9))
for i, var in enumerate(variables):
ax = fig.add_subplot(n_rows,n_cols,i+1)
asset = features.loc[:,var]
ax.scatter(df["AdjClose"], asset, c = next(colors))
ax.set_xlabel("AdjClose")
ax.set_ylabel(" {}".format(var))
ax.set_title(var +" vs AdjClose")
fig.tight_layout()
plt.show()
columns = vars_.columns
correlation(amz,vars_,columns,2,3)
Figure 1: Feature Selection Scatterplots
The scatterplot shows that there is an extremely positive correlation between the ma21 variable and the AdjClose. We can also visualize that there is a positive correlation between the 20sd and the TR variables with the target variable AdjClose.
On the other hand, there is a weak correlation of the features CCI and MACD with the target variable AdjClose.
Heat Maps are another tool that we can use to explore the relevance of each feature with respect to the target variable. The following lines make a Heat Map between the features and the target variable:
# Copy the vars_ dataframe into a new dataframe called df.
# Add the target variable to the vars_ dataframe to make a correlation matrix among features and target variable. Finally show a Heat Map with the values of the correlation matrix.
df = vars_.copy()
df['AdjClose'] = amz['AdjClose']
colormap = plt.cm.inferno
plt.figure(figsize=(10,5))
corr = df.corr()
sns.heatmap(corr[corr.index == 'AdjClose'], linewidths=0.1, vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True);
plt.show()
Figure 2: Heat Map between AdjClose variable and features
From the Heat Map we observe that the ma21 has perfect correlation with the AdjClose variable. Also 20sd and TR have positive correlation with the AdjClose, while the MACD and CCI don’t show significant correlation with AdjClose.
Lastly, we will use a method that is provided on the Random Forest algorithm that gives information about the feature importance. Firstly, we need to fit the model and then utilize the method called _feature_importance__ (these steps are explained in next sections, but here we want to inspect _feature_importance__ method ) on the model object.In the next line, we will fit a Random Forest Regressor model between the features and the target variable of the model, and extract the _features_importance__ method. Finally we make a bar plot of the features importance measure.
# Fit Random Forest Regressor and extract feature importance
prices = pd.DataFrame(amz['AdjClose'])
vars_model = prices.join(vars_)
vars_model= vars_model.dropna()
X = vars_model[['MACD','20sd','TR','ROC','CCI']].values
y = vars_model['AdjClose'].values
forest = RandomForestRegressor(n_estimators=1000)
forest = forest.fit(X, y)
importances = forest.feature_importances_
values = list(zip(vars_model.columns[1:],importances))
headers = ['feature','score']
values_df = pd.DataFrame(values,columns = headers)
# Plot the feature importance
columns = ['MACD','20sd','TR','ROC','CCI']
nd = np.arange(len(columns))
width=0.5
fig = plt.bar(nd, values_df['score'].values, color=sns.color_palette("deep", 5))
plt.legend(fig, columns, loc = 'upper right',bbox_to_anchor=(1.1, 1), title = "Feature Importance")
plt.show()
Figure 3: Feature Importance
The bar plot of the feature importance show that 20sd has the greater importance in the prediction of the target variable AdjClose. On the second place comes the TR feature with significantly lower importance.
In this section we provided some tools to analyze the power prediction of the features before start to training a Machine Learning model. Is important to inspect the independent variables first for selecting the best features.
After the inspection of the features, the next step consists in the training and testing steps where is necessary to split the data among the train and test data. This is explained in the following section.
Lesson Resources
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $29 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.