In data analysis, often there is a need to combine data from different sources into a single dataset. This process, known as merging or joining, is crucial when you're working with relational data. Pandas provides several functions to efficiently merge and join datasets.
Merging
Merging in pandas is similar to database-style joins. The merge function is used to combine two (or more) DataFrames on a common set of keys.
The how parameter in the merge function specifies how the merge should be done. It can take values like 'left', 'right', 'inner', and 'outer', similar to SQL joins.
inner: Returns only the rows with matching keys in both DataFrames.
outer: Returns all rows from both DataFrames, fills in NaNs for missing matches.
left: Returns all rows from the left DataFrame and matched rows from the right DataFrame.
right: Returns all rows from the right DataFrame and matched rows from the left DataFrame.
You can merge DataFrames based on columns or indices.
Joining
Joining in pandas is another technique to combine DataFrames. The join function is used for combining DataFrames based on their indices. It provides a more straightforward way to combine DataFrames but offers less flexibility than merge.
Like merge, join also supports 'left', 'right', 'inner', and 'outer' joins.
Both merging and joining are essential tools in pandas, enabling you to bring together data in different formats and from various sources to create a comprehensive dataset for analysis.
To practice this, we have created two datasets: Customers dataset, and Transactions dataset.
Customers Dataset: Contains customer IDs, names, and some attributes.
Transactions Dataset: Contains transaction IDs, customer IDs (linking to the customers dataset), and transaction details.
Jupyter notebook: merge_join.ipynb
These datasets are provided as csv files. Let’s start with loading these two datasets in Python.
1import pandas as pd
23# Load your datasets4customers_df = pd.read_csv('../data/customers_data.csv')5transactions_df = pd.read_csv('../data/transactions_data.csv')67
We can preview the data in both dataframes using the head() function.
Merging Datasets
Let's assume you want to merge the customers and transactions datasets on the CustomerID field to create a comprehensive dataset that includes customer details along with their transaction details. This can be done using pandas' merge function.
1# Merge the datasets2merged_df = pd.merge(customers_df, transactions_df, on='CustomerID', how='inner')34# Display the merged DataFrame5merged_df
67
In this example, how='inner' is used, which means the merged DataFrame will contain only those records that have matching CustomerID values in both datasets. You can change the how parameter to 'left', 'right', or 'outer' for different types of joins.
In the context of this dataset, where the key relationship between customers and transactions is defined through the CustomerID field rather than the indices of the rows, a join operation based on indices (as typically used with the join method) is not the most appropriate choice.
Joining on indices is most effective when the datasets have a meaningful and matching index structure, like time-series data aligned on dates or a one-to-one relationship between the rows of each dataset. In this case, where the relationship is defined through a common field (CustomerID) and not the row indices, merging is indeed the more suitable method.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.