- Pandas - Install Python and Pandas
- Basic Data Structures in Pandas
- Loading and Saving Data using Pandas
- Exploring Data using pandas
- Correlation Analysis using pandas
- Handling Categorical Data and Unique Values using pandas
- Data Visualization using pandas
- Handling Missing Data in Python
- Strategies for Handling Missing Data
- Handling Missing Data - Example - Part 1
- Handling Missing Data - Example - Part 2
- Handling Missing Data - Example - Part 3 (Non-numeric Values)
- Handling Missing Data - Example - Part 4
- Data Transformation and Feature Engineering
- Converting Data Types in Python pandas
- Encoding Categorical Data in Python pandas
- Handling Date and Time Data in Python pandas
- Renaming Columns in Python pandas
- Filtering Rows in a DataFrame in Python
- Merging and Joining Datasets in Python pandas
- Sorting and Indexing Data for Efficient Analysis in Python
Handling Missing Data in Python
In any data analysis workflow, cleaning and preparing the data is often one of the most crucial steps. This process involves handling missing values, correcting data types, dealing with duplicates, and potentially removing outliers.
Handling Missing Data
In any dataset, especially large ones, missing data is a common occurrence and can occur due to various reasons: errors during data collection, changes in the data source, or even by design (e.g., survey non-response). The way you handle missing data can significantly affect your subsequent analysis and results.
In Pandas, NaN (Not a Number) is the standard missing data marker.
Jupyter Notebook and Dataset
Jupyter Notebook: For this section, we will use a new Jupyter notebook. Navigate to the 'notebooks' folder and launch the ' data_manipulation.ipynb' notebook within your Jupyter session.
We will first load pandas library and then load a loans dataset that we’ve exclusively prepared for you to learn data cleaning techniques. The dataset contains lots of discrepancies that we will work on one by one. Filename: loan_data_with_issues.csv
import pandas as pd
# Load the dataset
file_path = '../data/loan_data_with_issues.csv'
loan_data = pd.read_csv(file_path)
Let’s use the .info() method on this loan data to gain insights into the DataFrame's structure, including the number of entries, the data types of each column, and the presence of any missing values.
# Get some info on the data
There are 52 rows and 10 different columns, like the customer's name, loan amounts, and IDs. Not all columns are fully filled out — a few like 'LoanDurationDays' have some blanks. Most of the stuff is text (that's what 'object' means here), but a couple of columns are numbers ('float64').
Identifying Missing Data
Before we handle missing data, we need to identify where and how much data is missing. Pandas offers two methods, isnull() and notnull(), to identify missing and non-missing values, respectively.
# Detecting missing values in the dataset
missing_data = loan_data.isnull()
The isnull() function returns a DataFrame where each cell is either True if missing or False if not. Use the following code, to get a summary count of missing values per column:
# Counting the number of missing values per column
missing_counts = loan_data.isnull().sum()
This is a bit more useful. We can see that our missing values are in Loan Amount, Loan Duration Days, Loan Amount Category, Total Loans, and Customer Loyalty. For each of these columns, we can decide how we want to fill those values. In some cases we may also decide to completely remove that row, or flag it so that it can be further investigated.
Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.