What is Principal Component Analysis (With Steps)

Imagine you are an investment analyst with access to a large number of investment options such as stocks, bonds, commodities, ETFs, etc. Your goal is to create a diversified portfolio with maximum returns and minimum risk.

While analysing these investments, one problem you’ll face is that all these assets are correlated with each other and their risk/return is affected by common factors such as market conditions, economic indicators, industry trends, etc. In order to create the most diversified portfolio, it’s important for you to account for the correlation between these assets.

If you’re considering asset classes such as stocks, bonds, commodities, and ETFs, then some of the common factors affecting them could be the underlying market returns, interest rate sensitivity, liquidity risk, currency fluctuations, underlying volatility, and other factors The actual factors will depend on your data set. This is a dataset with high dimensionality, each factor being one dimension. As humans, we can’t easily visualize more than three dimensions. Analyzing and visualizing this data with so many dimensions is not easy. What we want to do is work with fewer dimensions, may be 2 or 3. This is where Principal Component Analysis comes to help.

What is Principal Component Analysis

Principal Component Analysis (PCA) is a powerful unsupervised statistical technique that helps us reduce dimensionality and visualize multivariate data. PCA transforms our dataset into a new set of orthogonal* variables, which we call principal components.

* Orthogonality refers to the concept of statistical independence. Orthogonal variables are variables that are statistically independent from each other, meaning they are uncorrelated and their covariance is zero. In simple terms, knowing the value of one variable does not provide any information about the value of another orthogonal variable.

These principal components are uncorrelated with each other and are ranked by the amount of variance they explain in the dataset.

The first principal component accounts for the most possible variance in the data, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

This technique is beneficial for a variety of financial analyses where many variables could be interdependent. Using PCA, we reduce the dimensions without losing much information. This makes the dataset easier to work with.

After performing Principal Component Analysis (PCA), it's common to select the first few components that explain the most variance in the dataset for further analysis. PCA also makes visualizing this complex data easier.

Steps to Perform PCA

Performing PCA generally involves the following steps:

Step 1: Normalize Data

The first step is to normalize the data to have a mean of 0 and standard deviation of 1. This is important because different variables will be measured at different scales. Some may be in dollar value, while others may be in % terms. PCA is sensitive to variances in these values so it’s important to normalize them.

Step 2: Compute Covariance Matrix

The next step is to compute the covariance matrix to understand how these variables are related to each other.

Step 3: Compute Eigenvectors and Eigenvalues

Once we have the covariance matrix, we compute the Eigenvectors and Eigenvalues. Each eigenvector represents a principal component. The eigonvector represents a direction such as 90 degrees. Each eigenvector has its corresponding eigenvalue. Eigenvalue represents the variance explained by that component. It’s the variance present in the data in that direction. The eigenvector with the highest corresponding eigenvalue is the first principal component, the one with the second highest eigenvalue is the second principal component, and so on.

Step 4: Select Principal Components

This step involves selecting the principal components for further analysis. Usually, the first few principal components will explain a large amount of variance in the data. We want to select the number of principal components that explain the most variance. There are various criteria such as Kaiser's rule or the scree plot method, that are used to decide the number of components to retain.

Step 5: Transform the Data

The final step is to reorganize the data from it’s original form into its new form defined by principal components.The data in it’s new form is simpler to understand and analyze as it has fewer number of variables (as the principal components). To perform this data transformation, you multiply your original data by the matrix of the eigenvectors. This changes the original data points into a new set of values based on the selected principal components.

Note that this newly transformed data doesn’t have the same meaning as the original data. So, reading these values directly isn’t very useful. The value of this data lies in the structure and relationships with other data points.

How to Perform PCA

Principal component analysis can be performed using most statistical tools and programming languages such as R, Python, MATLAB, SAS, SPSS, Stata, Julia, and Excel.

Membership
Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.
I WANT TO JOIN
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.