Basic Data Structures in Pandas

Pandas is known for its powerful data structures, which are specifically designed for efficient data manipulation and analysis. The two primary structures provided by Pandas are the Series and DataFrame. They form the backbone of the library, allowing users to work with data in a way that is both intuitive and aligned with how data is structured in many real-world situations.

Members can download the data and code files from the sidebar or the course page.

Jupyter Notebook: Navigate to the 'notebooks' folder and launch the 'data_structures.ipynb' notebook within your Jupyter session. Once opened, you should see this:

If you're new to Jupyter, I recommend using this guide to familiarize yourself with its features and interface: JupyterLab User Guide.

Note: In a Jupyter notebook, although you have the flexibility to run code sections in any order, it's recommended to execute them from top to bottom. This ensures that all libraries are loaded and data transformations are applied in the correct sequence, helping to avoid errors and maintain the logical flow of your analysis.

Pandas Series

The Series is the simplest data structure in Pandas, representing a one-dimensional labeled array. You can think of it as a single column of data, along with an index that allows for both position-based and label-based access to elements. The flexibility of a Series makes it well-suited for representing individual data variables, especially when dealing with time series or any ordered dataset.

For example, when analyzing financial data, a Series can be used to represent the prices of a particular stock over a series of dates:

import pandas as pd

# A Series representing stock prices over a series of dates
dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
stock_prices = pd.Series([150, 152, 155, 157, 156], index=dates, name='Stock Price')

In this example, the dates provide a DateTimeIndex for the Series, which is perfect for time-series data. The name attribute is an optional way to give the Series a label, which is particularly useful when it becomes part of a DataFrame or when plotting the data.

We can now display the stock_prices series.

*Note on DataFrame Display in Jupyter: In a Jupyter notebook, we can display pandas DataFrames directly by simply writing their names, which leverages Jupyter's rich display format for better readability. However, in other environments like Python scripts, you should use print(dataframe) to display your DataFrames, as this method is universally applicable outside of Jupyter's interactive environment. For this book, since we’re using Jupyter notebooks, we will use the direct method without using the print statement.*

Pandas DataFrame

The second data type in pandas is the DataFrame, a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. It's akin to a spreadsheet or SQL table and is arguably the most important data structure in pandas. A DataFrame consists of an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.), and a DataFrame has both a row and column index. Each row represents one record.

The power of the DataFrame lies in its ability to represent real-world data in a way that's both easy to understand and highly accessible. For instance, a DataFrame can hold a complete dataset of stock prices with additional data like opening and closing prices, trading volume, and stock tickers:

import pandas as pd

# DataFrame representing a week (5 days) of stock data
data = {
    'Ticker': ['AAPL', 'GOOG', 'MSFT', 'AAPL', 'GOOG', 'MSFT', 
               'AAPL', 'GOOG', 'MSFT', 'AAPL', 'GOOG', 'MSFT',
               'AAPL', 'GOOG', 'MSFT'],
    'Date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-01',
                            '2023-01-02', '2023-01-02', '2023-01-02',
                            '2023-01-03', '2023-01-03', '2023-01-03',
                            '2023-01-04', '2023-01-04', '2023-01-04',
                            '2023-01-05', '2023-01-05', '2023-01-05']),
    'Open': [150, 2750, 225, 152, 2775, 230, 153, 2780, 235, 155, 2790, 240, 157, 2800, 245],
    'Close': [155, 2800, 230, 158, 2825, 235, 160, 2830, 240, 162, 2840, 245, 165, 2850, 250],
    'Volume': [1000000, 1200000, 750000, 1100000, 1250000, 800000, 1050000, 1300000, 770000, 1150000, 1350000, 820000, 1200000, 1400000, 850000]
stock_week = pd.DataFrame(data)
stock_week.set_index(['Ticker', 'Date'], inplace=True)

In the provided code, data is a Python dictionary containing the sample stock data, with keys representing column names and their corresponding values as lists of data entries. The pd.DataFrame method is then used to create a pandas DataFrame from this dictionary.

In the stock_week DataFrame example, stocks are tracked across multiple attributes over several days, allowing for complex and realistic data analysis. The .set_index() method is used to define a multi-index based on Ticker and Date, showcasing how Pandas can handle multi-dimensional indexing to represent more complex data relationships.

We can now display the contents of stock_week.

In the next section, we will learn how to manipulate these data structures.

Related Downloads

Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.