Handling Categorical Data and Unique Values using pandas
Premium
Categorical variables are those that represent a qualitative property of the data element, such as sector or industry in a financial context. Understanding how these non-numerical data are distributed is important, as they often hold key insights into the structure and segmentation of your dataset.
To do some analysis on categorical data, we will load a new data set (sp_500_constituents.csv). It’s a list of S&P 500 companies and contains these companies’ sector and industry.
1# Load S&P500 data from a CSV file23sp_df = pd.read_csv('../data/sp_500_constituents.csv')4sp_df.head()56
As you can see in this dataframe, we have a lot of categorical data. We’re interested in two columns: ‘GICS Sector’ and ‘GICS Sub-Industry’. Let’s do some analysis.
Analyzing Sector Distribution
The .value_counts() method in pandas is a quick way to get a count of unique values in a column. For financial datasets, where companies are categorized by sectors, this can provide a snapshot of the dataset's composition. This is particularly useful for understanding the representation of each sector in your analysis. A well-represented sector in your dataset might indicate a larger market cap or more active trading in that sector.
1# Display the frequency of each sector2sector_counts = sp_df['GICS Sector'].value_counts()3print(sector_counts)4
From the output, analysts can quickly ascertain which sectors dominate the dataset. In the accompanying Jupyter nodebook, you will also see additional sample code add a new column containing percentages for each sector.
Unique Industry Types
While the sector tells us the broad category a company falls into, the industry can be far more specific. Using .unique(), we can extract an array of all the unique industry types present in the DataFrame. This is essential for recognizing the breadth of your dataset and for ensuring that granular level analysis such as sector-to-industry breakdown is possible.
1# Get unique industries2unique_industries = sp_df['GICS Sub-Industry'].unique()3print(unique_industries)45
Seeing the array of industries, an analyst can begin to ask targeted questions: Are certain industries underrepresented? Does the dataset reflect the current economic landscape, where perhaps renewable energy companies are on the rise?
Practical Applications
Comparative Analysis: Using .value_counts(), we can compare the relative frequency of companies across sectors, potentially identifying overrepresented or underrepresented sectors in our financial analysis.
Sector-Specific Strategies: By understanding the unique industries within each sector, we can tailor our analysis to identify sector-specific trends, challenges, and opportunities.
Data Quality Checks: The list of unique values can also serve as a data quality check, ensuring that all expected industries are present and correctly labeled without typographical errors, which are common in large datasets.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.