- Pandas - Install Python and Pandas
- Basic Data Structures in Pandas
- Loading and Saving Data using Pandas
- Exploring Data using pandas
- Correlation Analysis using pandas
- Handling Categorical Data and Unique Values using pandas
- Data Visualization using pandas
- Handling Missing Data in Python
- Strategies for Handling Missing Data
- Handling Missing Data - Example - Part 1
- Handling Missing Data - Example - Part 2
- Handling Missing Data - Example - Part 3 (Non-numeric Values)
- Handling Missing Data - Example - Part 4
- Data Transformation and Feature Engineering
- Converting Data Types in Python pandas
- Encoding Categorical Data in Python pandas
- Handling Date and Time Data in Python pandas
- Renaming Columns in Python pandas
- Filtering Rows in a DataFrame in Python
- Merging and Joining Datasets in Python pandas
- Sorting and Indexing Data for Efficient Analysis in Python
Loading and Saving Data using Pandas
Loading Data
In data analysis, the first step is often to load data from various sources. Pandas offers a range of functions to facilitate this process, supporting formats such as CSV, Excel, SQL, JSON, and more. For data analysis, data can often come from CSV files, APIs, or databases.
Reading CSV Files
CSV (Comma-Separated Values) files are one of the most common data formats in data analysis.
In our data folder, we have a file named 'stocks.csv'. It contains daily stock prices data for 5 different stocks. We will load it into our Jupyter session using pandas' pd.read_csv method.
# Load stock data from a CSV file
stocks_df = pd.read_csv('../data/stocks.csv', parse_dates=['Date'], index_col='Date')
In this line of code, stocks.csv is the name of the file we’re reading from. The parse_dates parameter is used to specify which column should be parsed as a date, while index_col is used to set the date column as the index of the DataFrame, providing a time-series structure to the data.
We can now display this dataframe. The head() method in pandas displays the first few rows of a DataFrame. By default, it shows the first five rows, but you can specify a different number if needed within the brackets, like .head(10).
Reading from Excel
Excel files are another common source of data, particularly in finance. Pandas can read data from Excel with the read_excel function. We have some sample data containing a company’s financials for the past 10 years. This data is in an excel file. The below code shows how to load this data as a dataframe.
# Load financial data from an Excel file
financials_df = pd.read_excel('../data/company_financials_10_years.xlsx', sheet_name='Financials', parse_dates=['Date'])
In this example, we are loading data from an Excel file named 'company_financials_10_years.xlsx'.
The specific sheet we're accessing in this file is titled 'Financials'. By using the parse_dates parameter with pandas, we instruct it to treat the 'Date' column as a series of date objects, ensuring proper handling of dates within our data. We can now see the loaded dataframe.
There are many advanced concepts while reading data from excel sheets, such as loading data from multiple sheets, loading data from sheets with complex data structures. We’ve provided some sample code below, however, we will practice these in the practice exercises.
Dealing with Multiple Sheets
Excel files often contain multiple sheets with related data. Pandas can handle this by either specifying a particular sheet name or by loading all sheets at once.
# Load a specific sheet
january_data = pd.read_excel('monthly_data.xlsx', sheet_name='January')
# Load all sheets into a dictionary of DataFrames
all_months_data = pd.read_excel('monthly_data.xlsx', sheet_name=None)
When sheet_name=None, pandas returns a dictionary where each key is a sheet name and each value is a DataFrame corresponding to a sheet.
Handling Complex Data Structures
Excel files can sometimes contain complex layouts, such as merged cells or headers spanning multiple rows. Pandas offers parameters like header, skiprows, and usecols to deal with such complexities. Some examples:
# Load a sheet with a header that spans two rows
complex_header_df = pd.read_excel('complex_layout.xlsx', header=[0, 1])
# Load a sheet while skipping the first two rows and using specific columns
specified_data = pd.read_excel('data_with_notes.xlsx', skiprows=2, usecols='A:C')
With these capabilities, you can customize how Pandas interprets the incoming data, ensuring that the DataFrame structure reflects the data's true layout and meaning. Note: The examples of multiple sheets and complex data structures are not presented in the Jupyter notebook. With our stock data and financial data loaded into Pandas, we'll now dive into the first steps of data analysis: exploration and summarization.
Related Downloads
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $29 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.