Basic Data Structures in Pandas
Pandas is known for its powerful data structures, which are specifically designed for efficient data manipulation and analysis. The two primary structures provided by Pandas are the Series and DataFrame. They form the backbone of the library, allowing users to work with data in a way that is both intuitive and aligned with how data is structured in many real-world situations.
Members can download the data and code files from the sidebar or the course page.
Jupyter Notebook: Navigate to the 'notebooks' folder and launch the 'data_structures.ipynb' notebook within your Jupyter session. Once opened, you should see this:
If you're new to Jupyter, I recommend using this guide to familiarize yourself with its features and interface: JupyterLab User Guide.
Note: In a Jupyter notebook, although you have the flexibility to run code sections in any order, it's recommended to execute them from top to bottom. This ensures that all libraries are loaded and data transformations are applied in the correct sequence, helping to avoid errors and maintain the logical flow of your analysis.
Pandas Series
The Series is the simplest data structure in Pandas, representing a one-dimensional labeled array. You can think of it as a single column of data, along with an index that allows for both position-based and label-based access to elements. The flexibility of a Series makes it well-suited for representing individual data variables, especially when dealing with time series or any ordered dataset.
For example, when analyzing financial data, a Series can be used to represent the prices of a particular stock over a series of dates:
1import pandas as pd
2
3# A Series representing stock prices over a series of dates
4dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
5stock_prices = pd.Series([150, 152, 155, 157, 156], index=dates, name='Stock Price')
6
In this example, the dates provide a DateTimeIndex for the Series, which is perfect for time-series data. The name attribute is an optional way to give the Series a label, which is particularly useful when it becomes part of a DataFrame or when plotting the data.
We can now display the stock_prices series.
Note on DataFrame Display in Jupyter: In a Jupyter notebook, we can display pandas DataFrames directly by simply writing their names, which leverages Jupyter's rich display format for better readability. However, in other environments like Python scripts, you should use print(dataframe) to display your DataFrames, as this method is universally applicable outside of Jupyter's interactive environment. For this book, since weโre using Jupyter notebooks, we will use the direct method without using the print statement.
Pandas DataFrame
The second data type in pandas is the DataFrame, a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. It's akin to a spreadsheet or SQL table and is arguably the most important data structure in pandas. A DataFrame consists of an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.), and a DataFrame has both a row and column index. Each row represents one record.
The power of the DataFrame lies in its ability to represent real-world data in a way that's both easy to understand and highly accessible. For instance, a DataFrame can hold a complete dataset of stock prices with additional data like opening and closing prices, trading volume, and stock tickers:
1import pandas as pd
2
3# DataFrame representing a week (5 days) of stock data
4data = {
5 'Ticker': ['AAPL', 'GOOG', 'MSFT', 'AAPL', 'GOOG', 'MSFT',
6 'AAPL', 'GOOG', 'MSFT', 'AAPL', 'GOOG', 'MSFT',
7 'AAPL', 'GOOG', 'MSFT'],
8 'Date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-01',
9 '2023-01-02', '2023-01-02', '2023-01-02',
10 '2023-01-03', '2023-01-03', '2023-01-03',
11 '2023-01-04', '2023-01-04', '2023-01-04',
12 '2023-01-05', '2023-01-05', '2023-01-05']),
13 'Open': [150, 2750, 225, 152, 2775, 230, 153, 2780, 235, 155, 2790, 240, 157, 2800, 245],
14 'Close': [155, 2800, 230, 158, 2825, 235, 160, 2830, 240, 162, 2840, 245, 165, 2850, 250],
15 'Volume': [1000000, 1200000, 750000, 1100000, 1250000, 800000, 1050000, 1300000, 770000, 1150000, 1350000, 820000, 1200000, 1400000, 850000]
16}
17stock_week = pd.DataFrame(data)
18stock_week.set_index(['Ticker', 'Date'], inplace=True)
19
In the provided code, data is a Python dictionary containing the sample stock data, with keys representing column names and their corresponding values as lists of data entries. The pd.DataFrame method is then used to create a pandas DataFrame from this dictionary.
In the stock_week DataFrame example, stocks are tracked across multiple attributes over several days, allowing for complex and realistic data analysis. The .set_index() method is used to define a multi-index based on Ticker and Date, showcasing how Pandas can handle multi-dimensional indexing to represent more complex data relationships.
We can now display the contents of stock_week
.
In the next section, we will learn how to manipulate these data structures.