Handling Categorical Data and Unique Values using pandas

Categorical variables are those that represent a qualitative property of the data element, such as sector or industry in a financial context. Understanding how these non-numerical data are distributed is important, as they often hold key insights into the structure and segmentation of your dataset.

To do some analysis on categorical data, we will load a new data set (sp_500_constituents.csv). It’s a list of S&P 500 companies and contains these companies’ sector and industry.

# Load S&P500 data from a CSV file

sp_df = pd.read_csv('../data/sp_500_constituents.csv')
sp_df.head()

As you can see in this dataframe, we have a lot of categorical data. We’re interested in two columns: ‘GICS Sector’ and ‘GICS Sub-Industry’. Let’s do some analysis.

Analyzing Sector Distribution

The .value_counts() method in pandas is a quick way to get a count of unique values in a column. For financial datasets, where companies are categorized by sectors, this can provide a snapshot of the dataset's composition. This is particularly useful for understanding the representation of each sector in your analysis. A well-represented sector in your dataset might indicate a larger market cap or more active trading in that sector.

# Display the frequency of each sector
sector_counts = sp_df['GICS Sector'].value_counts()
print(sector_counts)

From the output, analysts can quickly ascertain which sectors dominate the dataset. In the accompanying Jupyter nodebook, you will also see additional sample code add a new column containing percentages for each sector.

Unique Industry Types

While the sector tells us the broad category a company falls into, the industry can be far more specific. Using .unique(), we can extract an array of all the unique industry types present in the DataFrame. This is essential for recognizing the breadth of your dataset and for ensuring that granular level analysis such as sector-to-industry breakdown is possible.

# Get unique industries
unique_industries = sp_df['GICS Sub-Industry'].unique()
print(unique_industries)

Seeing the array of industries, an analyst can begin to ask targeted questions: Are certain industries underrepresented? Does the dataset reflect the current economic landscape, where perhaps renewable energy companies are on the rise?

Practical Applications

  • Comparative Analysis: Using .value_counts(), we can compare the relative frequency of companies across sectors, potentially identifying overrepresented or underrepresented sectors in our financial analysis.
  • Sector-Specific Strategies: By understanding the unique industries within each sector, we can tailor our analysis to identify sector-specific trends, challenges, and opportunities.
  • Data Quality Checks: The list of unique values can also serve as a data quality check, ensuring that all expected industries are present and correctly labeled without typographical errors, which are common in large datasets.

Related Downloads

Membership
Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.
I WANT TO JOIN
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.