Feature Selection in Machine Learning

Feature Selection is one of the core concepts in machine learning and has a high impact on the performance of the model. Irrelevant or partially irrelevant features can negatively impact the model performance.

In this process, those features which contribute most to the prediction variable are selected. In order to get an idea about which features could have more predictive power in a machine learning model, we will load Open, High, Low, Close, Volume (OHLCV) data for AMZ stock, and create some new features using Python.

Afterwards, we will use data visualizations and other common approaches for a smart selection of the features.

Downloads

We will be performing this process using Python. The example below has 4 main steps:

  • Import the Python libraries that will be used.
  • Calculate Technical Indicators with the get_technical_indicators() function.
  • Plot scatterplots among features and target variable _AdjClose._
  • Make a Heat Map to show the correlation between each of the features and the target variable.
  • Fit a Random Forest Model to extract feature importance of the independent variables (we will explain this algorithm in future sections but now will use some of the tools that Random Forest provides).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor

The data for AMZ stock is loaded into the environment using the read_csv() method.

# To read the file, make sure to provide path to the correct directory in your computer.

amz = pd.read_csv("C:/Users/Nicolas/Documents/Machine_Learning_Course/AMZ.csv")

This content is for paid members only.

Join our membership for lifelong unlimited access to all our data science learning content and resources.

Lesson Resources

All Users