In data analysis, the first step is often to load data from various sources. Pandas offers a range of functions to facilitate this process, supporting formats such as CSV, Excel, SQL, JSON, and more. For data analysis, data can often come from CSV files, APIs, or databases.
Reading CSV Files
CSV (Comma-Separated Values) files are one of the most common data formats in data analysis.
In our data folder, we have a file named 'stocks.csv'. It contains daily stock prices data for 5 different stocks. We will load it into our Jupyter session using pandas' pd.read_csv method.
1# Load stock data from a CSV file2stocks_df = pd.read_csv('../data/stocks.csv', parse_dates=['Date'], index_col='Date')34
In this line of code, stocks.csv is the name of the file we’re reading from. The parse_dates parameter is used to specify which column should be parsed as a date, while index_col is used to set the date column as the index of the DataFrame, providing a time-series structure to the data.
We can now display this dataframe. The head() method in pandas displays the first few rows of a DataFrame. By default, it shows the first five rows, but you can specify a different number if needed within the brackets, like .head(10).
Reading from Excel
Excel files are another common source of data, particularly in finance. Pandas can read data from Excel with the read_excel function. We have some sample data containing a company’s financials for the past 10 years. This data is in an excel file. The below code shows how to load this data as a dataframe.
1# Load financial data from an Excel file2financials_df = pd.read_excel('../data/company_financials_10_years.xlsx', sheet_name='Financials', parse_dates=['Date'])3
In this example, we are loading data from an Excel file named 'company_financials_10_years.xlsx'. The specific sheet we're accessing in this file is titled 'Financials'. By using the parse_dates parameter with pandas, we instruct it to treat the 'Date' column as a series of date objects, ensuring proper handling of dates within our data. We can now see the loaded dataframe.
There are many advanced concepts while reading data from excel sheets, such as loading data from multiple sheets, loading data from sheets with complex data structures. We’ve provided some sample code below, however, we will practice these in the practice exercises.
Dealing with Multiple Sheets
Excel files often contain multiple sheets with related data. Pandas can handle this by either specifying a particular sheet name or by loading all sheets at once.
1# Load a specific sheet2january_data = pd.read_excel('monthly_data.xlsx', sheet_name='January')34# Load all sheets into a dictionary of DataFrames5all_months_data = pd.read_excel('monthly_data.xlsx', sheet_name=None)67
When sheet_name=None, pandas returns a dictionary where each key is a sheet name and each value is a DataFrame corresponding to a sheet.
Handling Complex Data Structures
Excel files can sometimes contain complex layouts, such as merged cells or headers spanning multiple rows. Pandas offers parameters like header, skiprows, and usecols to deal with such complexities. Some examples:
1# Load a sheet with a header that spans two rows2complex_header_df = pd.read_excel('complex_layout.xlsx', header=[0,1])34# Load a sheet while skipping the first two rows and using specific columns5specified_data = pd.read_excel('data_with_notes.xlsx', skiprows=2, usecols='A:C')6
With these capabilities, you can customize how Pandas interprets the incoming data, ensuring that the DataFrame structure reflects the data's true layout and meaning.
Note: The examples of multiple sheets and complex data structures are not presented in the Jupyter notebook.
With our stock data and financial data loaded into Pandas, we'll now dive into the first steps of data analysis: exploration and summarization.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.