- Machine Learning with Python
- What is Machine Learning?
- Data Preprocessing in Data Science and Machine Learning
- Feature Selection in Machine Learning
- Train-Test Datasets in Machine Learning
- Evaluate Model Performance - Loss Function
- Model Selection in Machine Learning
- Bias Variance Trade Off
- Supervised Learning Models
- Multiple Linear Regression
- Logistic Regression
- Logistic Regression in Python using scikit-learn Package
- Decision Trees in Machine Learning
- Random Forest Algorithm in Python
- Support Vector Machine Algorithm Explained
- Multivariate Linear Regression in Python with scikit-learn Library
- Classifier Model in Machine Learning Using Python
- Cross Validation to Avoid Overfitting in Machine Learning
- K-Fold Cross Validation Example Using Python scikit-learn
- Unsupervised Learning Models
- K-Means Algorithm Python Example
- Neural Networks Overview

# Decision Trees in Machine Learning

** Decision tree** is a popular Supervised learning algorithm which can handle

**and**

*classification***problems. For both problems, the algorithm breaks down a dataset into smaller subsets by using if-then-else decision rules within the features of the data.**

*regression*The general idea of a decision tree is that each of the features are evaluated by the algorithm and used to split the tree based on the capacity that they have to explain the target variable. The features could be categorical or continuous variables.

In the process of building the tree, the most important features are selected by the algorithm in a top-down approach creating the ** decision nodes** and

**of the tree and making predictions at points when the tree cannot be expanded any more.**

*branches*The tree is composed by a ** root node**,

**and**

*decision nodes***. The**

*leaf nodes***is the top most decision on the tree, and is the first time where the tree is split based on the best predictor of the dataset.**

*root node*The ** decision nodes** are the intermediate steps in the construction of the tree and are the nodes used to split the tree based on different values of the independent variables or features of the model. The

**represent the end points of the tree, and hold the prediction of the model.**

*leaf nodes***Decision Tree Structure**

The methods to split the tree are different depending if we are in a ** classification** or regression problem. We will provide an overview of the different measures that are used to split the tree in both types of problems.

**Classification Problem**

In this type of problem the algorithm decides which is the best feature to split the tree based on two measures that are ** Entropy** and

**.**

*Information Gain***Entropy**

** Entropy** is a coefficient that takes values between 0 and 1. It is a measure of randomness in data. If

**is equal to zero, it means that all the data belongs to the same class, and if entropy is equal to 1, it means that half of the data belong to one class and the half belong to the other class (in binary classification). In this last situation, the data is perfect randomness.**

*entropy*The mathematical formula of entropy is as follows:

The goal of a **decision tree** problem is to reduce the measure of **entropy** while building the tree. As we stated above, **entropy** is a measure of the randomness of the data, and this measure needs to be reduced. For this purpose, the entropy is combined with the **Information Gain** in order to select the feature that has higher value of **Information Gain**.

**Information Gain**

The ** Information gain** is based on the decrease in

**E****

*ntropy*

**after a dataset is split on an attribute or feature. The attribute that has the high value of**

*Information Gain*

**will be used to split the tree in nodes in a top-down approach.The mathematical formula of**

*Information Gain*** is the following:

In the equation above, the ** Information Gain** subtracts the

**of the target variable Y given X (E(Y/X)) from the entropy of the target variable Y (EY), to calculate the reduction of uncertainty of**

*entropy***given additional piece of information**

*Y***.**

*X*The aim of the ** decision tree** in this type of problem is to reduce the

**of the target variable. For this, the**

*entropy***algorithm would use the**

*decision tree***and the**

*Entropy***of each feature to decide what attribute will provide more information (or reduce the uncertainty of the target variable).**

*Information Gain***Decision Tree Regression Problem**

In the **regression** problem, the tree is composed (like in the classification problem) with ** nodes**,

**and**

*branches***, where each**

*leaf nodes***represents a feature or attribute, each**

*node***represents a decision rule and the**

*branch***represent an estimation of the target variable.**

*leaf nodes*Contrary to the ** classification** problem the estimation of the target variable in a

**problem is a continuous value and not a categorical value. The**

*regression*

*ID3***(Iterative Dichotomiser 3) can be used to construct a**

*algorithm***for**

*decision tree***which replace the**

*regression***metric with the**

*Information Gain***.**

*Standard Deviation Reduction (SDR)*The ** Standard Deviation Reduction** is based on the decrease in standard deviation after a dataset is split on an attribute. The tree will be split by the attribute with the higher

**of the target variable.**

*Standard Deviation Reduction*The standard deviation is used as a measure of homogeneity of the data. The tree is built top-down from the ** root node** and data is partitioned into subsets that contain homogeneous instances of the target variable. If the samples are homogeneous, its standard deviation is zero.

The mathematical formula of the ** Standard Deviation Reduction** is the following:

** Y** is target variable

** X** is a specific attribute of the dataset

SDRY,X is the reduction of the standard deviation of the target variable Y given the information of X

SY is the standard deviation of the target variable

** S(Y,X)** is the standard deviation of the target variable given the information of X

The calculation of ** S(Y,X)** can be broken down with the following formula that is important to understand the process:

Sc is the standard deviation of the target variable based on unique values of the independent variable.

Pc number of times (probability) of the value of the independent variables over the total values of the independent variable

So the ** S(Y,X)** is the sum of the probability of each of the values of the independent variable multiplied by the standard deviation of the target variable based on those values of the independent variable.

The Standard deviation ** S(Y,X)** is calculated for each feature or independent variable of the dataset. The feature with the great value of

**is the feature that has more capability to explain the variations in the target variable, therefore is the feature that the decision tree algorithm will use to split the tree.**

*S(Y,X)*The same idea is that the feature that will be used to split the tree is the one that has the greatest ** Standard Deviation Reduction. The overall process** of splitting a decision tree under the

**can be synthetized in the following steps:**

*Standard Deviation Reduction*- Calculate the standard deviation of the target variable
- Calculate the
for all the independent variables*Standard Deviation Reduction* - The feature with the higher value on the
is chosen for the decision node.*Standard Deviation Reduction* - In each branch, the
(*Coefficient Variation*) is calculated. This coefficient is a measure of homogeneity in the target variable. CV = Sx*100 (Standard deviation) /(mean of the target variable) * 100*CV* - If the
is less than certain threshold such as 10% that subset does not need further splitting and the node become a*CV*.*leaf node*

**Overfitting in Decision Tree**

One of the most common problems with ** decision tree** is that they tend to overfit a lot. There are situations where the tree is built with specific features of the training data and the predictive power for unseen data is reduced.

This is because the ** branches** of the tree follow strict and specific rules of the train dataset but for samples that are not part of the training set the accuracy of the model is not good. There are two main ways to address over-fitting in

**. One is by using the**

*Decision tree***algorithm and the second is by**

*Random Forest***the tree.**

*pruning***Random Forest**

Random Forest avoid overfitting as it is an ensemble of ** n decision trees** (not just one) with

**results in the end. The final result of**

*n***is the most frequent response variable among the**

*Random forest***results of**

*n***if the target variable is a categorical variable or the average of the target variable in the**

*the n Decision Trees***if the target variable is a continuous variable.**

*leaf node***Pruning**

Pruning is done after the initial training is complete, and involves the removal of ** nodes** and

**starting from the**

*branches***such that the overall accuracy is not affected. The pruning technique will reduce the size of the tree by removing nodes that provide little power to classify or predict new instances.**

*leaf nodes*