- Pandas - Install Python and Pandas
- Basic Data Structures in Pandas
- Loading and Saving Data using Pandas
- Exploring Data using pandas
- Correlation Analysis using pandas
- Handling Categorical Data and Unique Values using pandas
- Data Visualization using pandas
- Handling Missing Data in Python
- Strategies for Handling Missing Data
- Handling Missing Data - Example - Part 1
- Handling Missing Data - Example - Part 2
- Handling Missing Data - Example - Part 3 (Non-numeric Values)
- Handling Missing Data - Example - Part 4
- Data Transformation and Feature Engineering
- Converting Data Types in Python pandas
- Encoding Categorical Data in Python pandas
- Handling Date and Time Data in Python pandas
- Renaming Columns in Python pandas
- Filtering Rows in a DataFrame in Python
- Merging and Joining Datasets in Python pandas
- Sorting and Indexing Data for Efficient Analysis in Python
Basic Data Structures in Pandas
Pandas is known for its powerful data structures, which are specifically designed for efficient data manipulation and analysis. The two primary structures provided by Pandas are the Series and DataFrame. They form the backbone of the library, allowing users to work with data in a way that is both intuitive and aligned with how data is structured in many real-world situations.
Members can download the data and code files from the sidebar or the course page.
Jupyter Notebook: Navigate to the 'notebooks' folder and launch the 'data_structures.ipynb' notebook within your Jupyter session. Once opened, you should see this:
If you're new to Jupyter, I recommend using this guide to familiarize yourself with its features and interface: JupyterLab User Guide.
Note: In a Jupyter notebook, although you have the flexibility to run code sections in any order, it's recommended to execute them from top to bottom. This ensures that all libraries are loaded and data transformations are applied in the correct sequence, helping to avoid errors and maintain the logical flow of your analysis.
Pandas Series
The Series is the simplest data structure in Pandas, representing a one-dimensional labeled array. You can think of it as a single column of data, along with an index that allows for both position-based and label-based access to elements. The flexibility of a Series makes it well-suited for representing individual data variables, especially when dealing with time series or any ordered dataset.
For example, when analyzing financial data, a Series can be used to represent the prices of a particular stock over a series of dates:
import pandas as pd
# A Series representing stock prices over a series of dates
dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
stock_prices = pd.Series([150, 152, 155, 157, 156], index=dates, name='Stock Price')
In this example, the dates provide a DateTimeIndex for the Series, which is perfect for time-series data. The name attribute is an optional way to give the Series a label, which is particularly useful when it becomes part of a DataFrame or when plotting the data.
We can now display the stock_prices series.
*Note on DataFrame Display in Jupyter: In a Jupyter notebook, we can display pandas DataFrames directly by simply writing their names, which leverages Jupyter's rich display format for better readability. However, in other environments like Python scripts, you should use print(dataframe) to display your DataFrames, as this method is universally applicable outside of Jupyter's interactive environment. For this book, since we’re using Jupyter notebooks, we will use the direct method without using the print statement.*
Pandas DataFrame
The second data type in pandas is the DataFrame, a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. It's akin to a spreadsheet or SQL table and is arguably the most important data structure in pandas. A DataFrame consists of an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.), and a DataFrame has both a row and column index. Each row represents one record.
The power of the DataFrame lies in its ability to represent real-world data in a way that's both easy to understand and highly accessible. For instance, a DataFrame can hold a complete dataset of stock prices with additional data like opening and closing prices, trading volume, and stock tickers:
import pandas as pd
# DataFrame representing a week (5 days) of stock data
data = {
'Ticker': ['AAPL', 'GOOG', 'MSFT', 'AAPL', 'GOOG', 'MSFT',
'AAPL', 'GOOG', 'MSFT', 'AAPL', 'GOOG', 'MSFT',
'AAPL', 'GOOG', 'MSFT'],
'Date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-01',
'2023-01-02', '2023-01-02', '2023-01-02',
'2023-01-03', '2023-01-03', '2023-01-03',
'2023-01-04', '2023-01-04', '2023-01-04',
'2023-01-05', '2023-01-05', '2023-01-05']),
'Open': [150, 2750, 225, 152, 2775, 230, 153, 2780, 235, 155, 2790, 240, 157, 2800, 245],
'Close': [155, 2800, 230, 158, 2825, 235, 160, 2830, 240, 162, 2840, 245, 165, 2850, 250],
'Volume': [1000000, 1200000, 750000, 1100000, 1250000, 800000, 1050000, 1300000, 770000, 1150000, 1350000, 820000, 1200000, 1400000, 850000]
}
stock_week = pd.DataFrame(data)
stock_week.set_index(['Ticker', 'Date'], inplace=True)
In the provided code, data is a Python dictionary containing the sample stock data, with keys representing column names and their corresponding values as lists of data entries. The pd.DataFrame method is then used to create a pandas DataFrame from this dictionary.
In the stock_week DataFrame example, stocks are tracked across multiple attributes over several days, allowing for complex and realistic data analysis. The .set_index() method is used to define a multi-index based on Ticker and Date, showcasing how Pandas can handle multi-dimensional indexing to represent more complex data relationships.
We can now display the contents of stock_week
.
In the next section, we will learn how to manipulate these data structures.
Related Downloads
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $29 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.