The Random Forest algorithm can be described in the following conceptual steps:
- Select k features randomly from the dataset and build a decision tree from those features where k < m (total number of features)
- Repeat this n times in order to have n decision trees from different random combinations of k features.
- Take each of the n built Decision Trees and pass a random variable to predict the outcome and store this outcome to get a total of n outcomes from n decision trees.
- If the target variable is a categorical variable, each tree in the forest would predict the category to which the new record belongs and the new record is assigned to the category that has the majority vote.
- If the target variable is a continuous variable, each tree in the forest predicts a value for the target variable and the final value is calculated by taking the average of all the values predicted by the trees that are part of the forest.
Using the scikit learn package from Python, it is possible to use and tune a Random Forest model based on predefined conditions that will give instructions to the algorithm regarding the construction of the trees that are part of the forest.
The scikit learn library allows to tune some important parameters in the tree construction that could increase the predictive power of the model or make the model faster.
Parameters to increase the predictive power:
- n_estimators: represents the number of trees in the forest. In general, a higher number of trees increases the performance and makes the predictions more stable. The default value of this parameter is 10.
- max_features: this parameter reflects the maximum number of features from the dataset that the Random Forest is allowed to use in an individual tree when considered for the best split. The default value of the parameter is sqrt(n_features) – Root Square of the n features of the model.
- min_samples_leaf: the minimum number of observations in each leaf node. This parameter would prevent further splitting when the number of observations in the node is less than the value of this parameter. The default value of this parameter is 1.
- max_depth: the maximum depth of the tree. The depth of a decision tree is the length of the longest path from a root to a leaf. The default value of this parameter is None. In this case the tree is split until nodes contain less than min_samples_split samples.
Parameters to increase the model speed:
- n_jobs: tells the engine the number of jobs to run in parallel for fit and predict.
- random_state: this parameter makes the model reproducible, as it will output the same result every time the model is run.
- oob_score: Boolean value that allows to test the model performance keeping a portion of the data to test the model performance.