In any data analysis workflow, cleaning and preparing the data is often one of the most crucial steps. This process involves handling missing values, correcting data types, dealing with duplicates, and potentially removing outliers.
Handling Missing Data
In any dataset, especially large ones, missing data is a common occurrence and can occur due to various reasons: errors during data collection, changes in the data source, or even by design (e.g., survey non-response). The way you handle missing data can significantly affect your subsequent analysis and results.
In Pandas, NaN (Not a Number) is the standard missing data marker.
Jupyter Notebook and Dataset
Jupyter Notebook: For this section, we will use a new Jupyter notebook. Navigate to the 'notebooks' folder and launch the ' data_manipulation.ipynb' notebook within your Jupyter session.
We will first load pandas library and then load a loans dataset that we’ve exclusively prepared for you to learn data cleaning techniques. The dataset contains lots of discrepancies that we will work on one by one. Filename:loan_data_with_issues.csv
1import pandas as pd
23# Load the dataset4file_path ='../data/loan_data_with_issues.csv'5loan_data = pd.read_csv(file_path)6loan_data.head()78
Let’s use the .info() method on this loan data to gain insights into the DataFrame's structure, including the number of entries, the data types of each column, and the presence of any missing values.
1# Get some info on the data2loan_data.info()34
There are 52 rows and 10 different columns, like the customer's name, loan amounts, and IDs. Not all columns are fully filled out — a few like 'LoanDurationDays' have some blanks. Most of the stuff is text (that's what 'object' means here), but a couple of columns are numbers ('float64').
Identifying Missing Data
Before we handle missing data, we need to identify where and how much data is missing. Pandas offers two methods, isnull() and notnull(), to identify missing and non-missing values, respectively.
1# Detecting missing values in the dataset 2missing_data = loan_data.isnull()3missing_data.head()45
The isnull() function returns a DataFrame where each cell is either True if missing or False if not. Use the following code, to get a summary count of missing values per column:
1# Counting the number of missing values per column 2missing_counts = loan_data.isnull().sum()3print(missing_counts)45
This is a bit more useful. We can see that our missing values are in Loan Amount, Loan Duration Days, Loan Amount Category, Total Loans, and Customer Loyalty. For each of these columns, we can decide how we want to fill those values. In some cases we may also decide to completely remove that row, or flag it so that it can be further investigated.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.