Unsupervised learning models are composed of features that are not associated with a response. This means that this type of machine learning algorithms do not have labelled data as their interest lies in the attributes of the features themselves.
In unsupervised learning models there is no concept of training or supervising a dataset as the independent variables or features (x1,x2,x3,..,xn)) are not paired with a response (y). The goal of these problems is to model the underlying structure or distribution of the data to learn more about it.
Many machine learning problems use unsupervised techniques because sometimes it can be very expensive or time consuming to label all data or there can be cases where the data possesses high dimensionality that is not suitable for supervised learning techniques because it could lead to overfitting and have poor test performance.
Some examples of unsupervised learning problems in quantitative finance are the following:
- Portfolio/asset clustering
- Market regime identification
- Trading signal generation with natural language processing
There are two important techniques for unsupervised learning models. They are Dimensionality Reduction and Cluster.
Some machine learning problems have a large number of features that are correlated in a high dimensional space. Dimensionality reduction is the process of reducing the number of features under consideration, by obtaining a set of principal features.
The most common method for Dimensionality Reduction is the Principal Component Analysis (PCA). This method reduces the features/dimensions of the data by transforming their original space into a set of linearly, uncorrelated variables called principal components.
The principal components can be founded as the eigenvectors of the covariance matrix of the data. This procedure tries to preserve the components of the data with the higher variation and remove components with fewer variations. (The Eigenvectors represent the key information of a matrix. These are used when the features/dimensions of the data are high and it is necessary to reduce the dimension to manage the dataset.)
In quantitative finance, this technique could be applied to a large number of correlated stocks in order to reduce their dimensionality by looking at a smaller set of uncorrelated factors.
Another important unsupervised learning technique is Cluster Analysis. Its goal is to assign a cluster label for each observation of a dataset in order to group the dataset in clusters with similar properties. In quantitative finance clustering is used to identify assets with similar characteristics which become useful to construct diversified portfolios.
A key element of the cluster technique is to define the number of clusters K in which the data will be partitioned. Scikit learn library provides a straightforward procedure to perform this task using the K-means algorithm. First we describe the inner steps of the algorithm and then we use and example. The algorithm performs the following steps:
- Pick k clusters centers (centroids) randomly. The number of clusters is given by the k parameter
- Assign each observation of the dataset to the nearest cluster by using the Euclidean Distance between the observation and each centroid.
- Find the new centroid or new cluster mean corresponding to each observation by taking the average of all the points assigned to that cluster.
- Repeat the procedure 2 and 3 until none of the cluster assignment change. This means until the clusters remain stable.
This algorithm can be used for more than one purpose. One good reason to use this algorithm is to identify different clusters of stocks that can support a portfolio diversification strategy among stocks in different clusters.