Loading and Saving Data using Pandas

Loading Data

In data analysis, the first step is often to load data from various sources. Pandas offers a range of functions to facilitate this process, supporting formats such as CSV, Excel, SQL, JSON, and more. For data analysis, data can often come from CSV files, APIs, or databases.

Reading CSV Files

CSV (Comma-Separated Values) files are one of the most common data formats in data analysis.

In our data folder, we have a file named 'stocks.csv'. It contains daily stock prices data for 5 different stocks. We will load it into our Jupyter session using pandas' pd.read_csv method.

# Load stock data from a CSV file
stocks_df = pd.read_csv('../data/stocks.csv', parse_dates=['Date'], index_col='Date')

In this line of code, stocks.csv is the name of the file we’re reading from. The parse_dates parameter is used to specify which column should be parsed as a date, while index_col is used to set the date column as the index of the DataFrame, providing a time-series structure to the data.

We can now display this dataframe. The head() method in pandas displays the first few rows of a DataFrame. By default, it shows the first five rows, but you can specify a different number if needed within the brackets, like .head(10).

Reading from Excel

Excel files are another common source of data, particularly in finance. Pandas can read data from Excel with the read_excel function. We have some sample data containing a company’s financials for the past 10 years. This data is in an excel file. The below code shows how to load this data as a dataframe.

# Load financial data from an Excel file
financials_df = pd.read_excel('../data/company_financials_10_years.xlsx', sheet_name='Financials', parse_dates=['Date'])

In this example, we are loading data from an Excel file named 'company_financials_10_years.xlsx'. The specific sheet we're accessing in this file is titled 'Financials'. By using the parse_dates parameter with pandas, we instruct it to treat the 'Date' column as a series of date objects, ensuring proper handling of dates within our data. We can now see the loaded dataframe.

There are many advanced concepts while reading data from excel sheets, such as loading data from multiple sheets, loading data from sheets with complex data structures. We’ve provided some sample code below, however, we will practice these in the practice exercises.

Dealing with Multiple Sheets

Excel files often contain multiple sheets with related data. Pandas can handle this by either specifying a particular sheet name or by loading all sheets at once.

# Load a specific sheet
january_data = pd.read_excel('monthly_data.xlsx', sheet_name='January')

# Load all sheets into a dictionary of DataFrames
all_months_data = pd.read_excel('monthly_data.xlsx', sheet_name=None)

When sheet_name=None, pandas returns a dictionary where each key is a sheet name and each value is a DataFrame corresponding to a sheet.

Handling Complex Data Structures

Excel files can sometimes contain complex layouts, such as merged cells or headers spanning multiple rows. Pandas offers parameters like header, skiprows, and usecols to deal with such complexities. Some examples:

# Load a sheet with a header that spans two rows
complex_header_df = pd.read_excel('complex_layout.xlsx', header=[0, 1])

# Load a sheet while skipping the first two rows and using specific columns
specified_data = pd.read_excel('data_with_notes.xlsx', skiprows=2, usecols='A:C')

With these capabilities, you can customize how Pandas interprets the incoming data, ensuring that the DataFrame structure reflects the data's true layout and meaning. Note: The examples of multiple sheets and complex data structures are not presented in the Jupyter notebook. With our stock data and financial data loaded into Pandas, we'll now dive into the first steps of data analysis: exploration and summarization.

Related Downloads

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $39 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.