Handling Missing Data in Python

In any data analysis workflow, cleaning and preparing the data is often one of the most crucial steps. This process involves handling missing values, correcting data types, dealing with duplicates, and potentially removing outliers.

Handling Missing Data

In any dataset, especially large ones, missing data is a common occurrence and can occur due to various reasons: errors during data collection, changes in the data source, or even by design (e.g., survey non-response). The way you handle missing data can significantly affect your subsequent analysis and results.

In Pandas, NaN (Not a Number) is the standard missing data marker.

Jupyter Notebook and Dataset

Jupyter Notebook: For this section, we will use a new Jupyter notebook. Navigate to the 'notebooks' folder and launch the ' data_manipulation.ipynb' notebook within your Jupyter session.

We will first load pandas library and then load a loans dataset that we’ve exclusively prepared for you to learn data cleaning techniques. The dataset contains lots of discrepancies that we will work on one by one. Filename: loan_data_with_issues.csv

import pandas as pd

# Load the dataset
file_path = '../data/loan_data_with_issues.csv'
loan_data = pd.read_csv(file_path)

Let’s use the .info() method on this loan data to gain insights into the DataFrame's structure, including the number of entries, the data types of each column, and the presence of any missing values.

# Get some info on the data

There are 52 rows and 10 different columns, like the customer's name, loan amounts, and IDs. Not all columns are fully filled out — a few like 'LoanDurationDays' have some blanks. Most of the stuff is text (that's what 'object' means here), but a couple of columns are numbers ('float64').

Identifying Missing Data

Before we handle missing data, we need to identify where and how much data is missing. Pandas offers two methods, isnull() and notnull(), to identify missing and non-missing values, respectively.

# Detecting missing values in the dataset 
missing_data = loan_data.isnull() 

The isnull() function returns a DataFrame where each cell is either True if missing or False if not. Use the following code, to get a summary count of missing values per column:

# Counting the number of missing values per column 
missing_counts = loan_data.isnull().sum() 

This is a bit more useful. We can see that our missing values are in Loan Amount, Loan Duration Days, Loan Amount Category, Total Loans, and Customer Loyalty. For each of these columns, we can decide how we want to fill those values. In some cases we may also decide to completely remove that row, or flag it so that it can be further investigated.

Related Downloads

Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.