- Pandas - Install Python and Pandas
- Basic Data Structures in Pandas
- Loading and Saving Data using Pandas
- Exploring Data using pandas
- Correlation Analysis using pandas
- Handling Categorical Data and Unique Values using pandas
- Data Visualization using pandas
- Handling Missing Data in Python
- Strategies for Handling Missing Data
- Handling Missing Data - Example - Part 1
- Handling Missing Data - Example - Part 2
- Handling Missing Data - Example - Part 3 (Non-numeric Values)
- Handling Missing Data - Example - Part 4
- Data Transformation and Feature Engineering
- Converting Data Types in Python pandas
- Encoding Categorical Data in Python pandas
- Handling Date and Time Data in Python pandas
- Renaming Columns in Python pandas
- Filtering Rows in a DataFrame in Python
- Merging and Joining Datasets in Python pandas
- Sorting and Indexing Data for Efficient Analysis in Python
Handling Categorical Data and Unique Values using pandas
Categorical variables are those that represent a qualitative property of the data element, such as sector or industry in a financial context. Understanding how these non-numerical data are distributed is important, as they often hold key insights into the structure and segmentation of your dataset.
To do some analysis on categorical data, we will load a new data set (sp_500_constituents.csv)
. It’s a list of S&P 500 companies and contains these companies’ sector and industry.
# Load S&P500 data from a CSV file
sp_df = pd.read_csv('../data/sp_500_constituents.csv')
sp_df.head()
As you can see in this dataframe, we have a lot of categorical data. We’re interested in two columns: ‘GICS Sector’
and ‘GICS Sub-Industry’
. Let’s do some analysis.
Analyzing Sector Distribution
The .value_counts() method in pandas is a quick way to get a count of unique values in a column. For financial datasets, where companies are categorized by sectors, this can provide a snapshot of the dataset's composition. This is particularly useful for understanding the representation of each sector in your analysis. A well-represented sector in your dataset might indicate a larger market cap or more active trading in that sector.
# Display the frequency of each sector
sector_counts = sp_df['GICS Sector'].value_counts()
print(sector_counts)
From the output, analysts can quickly ascertain which sectors dominate the dataset. In the accompanying Jupyter nodebook, you will also see additional sample code add a new column containing percentages for each sector.
Unique Industry Types
While the sector tells us the broad category a company falls into, the industry can be far more specific. Using .unique(), we can extract an array of all the unique industry types present in the DataFrame. This is essential for recognizing the breadth of your dataset and for ensuring that granular level analysis such as sector-to-industry breakdown is possible.
# Get unique industries
unique_industries = sp_df['GICS Sub-Industry'].unique()
print(unique_industries)
Seeing the array of industries, an analyst can begin to ask targeted questions: Are certain industries underrepresented? Does the dataset reflect the current economic landscape, where perhaps renewable energy companies are on the rise?
Practical Applications
- Comparative Analysis: Using .value_counts(), we can compare the relative frequency of companies across sectors, potentially identifying overrepresented or underrepresented sectors in our financial analysis.
- Sector-Specific Strategies: By understanding the unique industries within each sector, we can tailor our analysis to identify sector-specific trends, challenges, and opportunities.
- Data Quality Checks: The list of unique values can also serve as a data quality check, ensuring that all expected industries are present and correctly labeled without typographical errors, which are common in large datasets.
You may find these interesting
Related Downloads
Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.