The Random Forest algorithm can be described in the following conceptual steps: Select k features randomly from the dataset and build a decision tree from those features where k < m (total number of features) Repeat this n times in order to have n decision trees from different random combinations of k features. Take each […]

# Data Science

## Decision Trees in Machine Learning

Decision tree is a popular Supervised learning algorithm which can handle classification and regression problems. For both problems, the algorithm breaks down a dataset into smaller subsets by using if-then-else decision rules within the features of the data. The general idea of a decision tree is that each of the features are evaluated by the […]

## Logistic Regression in Python using scikit-learn Package

Using the scikit-learn package from python, we can fit and evaluate a logistic regression algorithm with a few lines of code. Also, for binary classification problems the library provides interesting metrics to evaluate model performance such as the confusion matrix, Receiving Operating Curve (ROC) and the Area Under the Curve (AUC). Hyperparameter Tuning in Logistic […]

## Logistic Regression

In machine learning, the Logistic Regression algorithm is used for classification problems. It provides an output that we can interpret as a probability that a new observation belongs to a certain class. Generally, logistic regression is used to classify binary classes but works on multiple and ordinal classes too. Logistic regression estimates a continuous quantity […]

## Multiple Linear Regression

The multiple linear regression algorithm states that a response y can be estimated with a set of input features x and an error term ɛ. The model can be expressed with the following mathematical equation: βTX is the matrix notation of the equation, where βT, X ϵ ʀp+1 and ɛ ~ N(μ,σ2) βT(transpose of β) […]

## Supervised Learning Models

As we pointed out earlier, both classification and regression models are in the field of Supervised Learning. These models are characterized by having a group of features or independent variables and a target variable that is the variable that the model aims to predict. This target variable is called the labelled data and is the […]

## Bias Variance Trade Off

The interesting property of a machine learning model is its capacity to predict or categorize new unseen data (data that was not used in training the model). For this reason the important measure is the MSE error with test data, which is denominated as test MSE. The goal is to choose a model where the […]

## Model Selection in Machine Learning

Model selection refers to choose the best statistical machine learning model for a particular problem. For this task we need to compare the relative performance between models. Therefore the loss function and the metric that represent it, becomes fundamental for selecting the right and non-overfitted model. We can state a machine learning supervised problem with […]

## Evaluate Model Performance – Loss Function

The predictions that the model returns will be compared with the real observation of the variable to obtain a measure of model performance. In order to evaluate the model performance, both classification and regression problems need to define a function called loss function that will compute the error of the model. This loss function is […]

## Train-Test Datasets in Machine Learning

Once we have cleaned the data and have selected the features from the data for building the model, the next step is to generate the train and test dataset. We will divide our data into two different data sets, namely training and testing datasets. The model will be built using the training set and then […]