Random Forest Algorithm in Python

The Random Forest algorithm can be described in the following conceptual steps:

  • Select k features randomly from the dataset and build a decision tree from those features where k < m (total number of features)
  • Repeat this n times in order to have n decision trees from different random combinations of k features. 
  • Take each of the n built Decision Trees and pass a random variable to predict the outcome and store this outcome to get a total of n outcomes from n decision trees.
  • If the target variable is a categorical variable, each tree in the forest would predict the category to which the new record belongs and the new record is assigned to the category that has the majority vote. 
  • If the target variable is a continuous variable, each tree in the forest predicts a value for the target variable and the final value is calculated by taking the average of all the values predicted by the trees that are part of the forest.

Using the scikit learn package from Python, it is possible to use and tune a Random Forest model based on predefined conditions that will give instructions to the algorithm regarding the construction of  the trees that are part of the forest.

Hyperparameter Tuning

The scikit learn library allows to tune some important parameters in the tree construction that could increase the predictive power of the model or make the model faster. 

Parameters to increase the predictive power:

  • n_estimators: represents the number of trees in the forest. In general, a higher number of trees increases the performance and makes the predictions more stable. The default value of this parameter is 10.
  • max_features: this parameter reflects the maximum number of features from the dataset that the Random Forest is allowed to use in an individual tree when considered for the best split. The default value of the parameter is sqrt(n_features) - Root Square of the n features of the model.
  • min_samples_leaf: the minimum number of observations in each leaf node. This parameter would prevent further splitting when the number of observations in the node is less than the value of this parameter. The default value of this parameter is 1.
  • max_depth: the maximum depth of the tree. The depth of a decision tree is the length of the longest path from a root to a leaf. The default value of this parameter is None. In this case the tree is split until nodes contain less than min_samples_split samples.

Parameters to increase the model speed:

  • n_jobs: tells the engine the number of jobs to run in parallel for fit and predict.
  • random_state: this parameter makes the model reproducible, as it will output the same result every time the model is run.
  • oob_score: Boolean value that allows to test the model performance keeping a portion of the data to test the model performance.
Membership
Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.
I WANT TO JOIN
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.