Handling Missing Data - Example - Part 1

In the previous lesson, we learned about various strategies to handle missing data. Let’s now work through our example to handle missing values. Here’s our missing data report:

Let’s work through each column containing missing data.

Loan Amount

This is a numerical field and an important one. We can delete all rows which have a loan amount missing. Alternatively, the missing values can be filled with either the mean or median of the existing loan amounts. The choice between mean and median depends on whether the data is skewed or not. If there are outliers (very high or very low loan amounts), median is a better choice as it is less affected by outliers. Below we show how you can delete rows with missing values or fill the missing values using Median. You can continue with either approach in your analysis.

# Remove rows where 'LoanAmount' is missing
loan_data_cleaned = loan_data.dropna(subset=['LoanAmount'])

# You can check the shape of the dataframe to confirm the rows were dropped
print(loan_data_cleaned.shape)
loan_data_cleaned

# Fill missing values with median

# Create a copy of the original DataFrame
loan_data_copy = loan_data.copy()

# Convert 'LoanAmount' to numeric in the copied DataFrame, non-numeric values will become NaN
loan_data_copy['LoanAmount'] = pd.to_numeric(loan_data_copy['LoanAmount'], errors='coerce')

# Calculate the median of the 'LoanAmount' column in the copied DataFrame
median_loan_amount = loan_data_copy['LoanAmount'].median()

# Fill missing values in 'LoanAmount' of the copied DataFrame with the calculated median
loan_data_copy['LoanAmount'].fillna(median_loan_amount, inplace=True)

# Optional: Check if there are any missing values left in 'LoanAmount' of the copied DataFrame
print(loan_data_copy['LoanAmount'].isnull().sum())

In the Jupyter notebook, we have only shown the first method, i.e., remove rows with missing LoanAmount values.

Similarly, let’s work with the other missing data also.

Loan Duration Days: Let’s fill it with the number of days between loan start date and loan end date.

Loan Amount Category: If the loan amount is 1000 and below, we will fill ‘Small’. If it is 2000 and below but more than 1000, its ‘Medium’. If it is above 2000, it is ‘Large’.

Total Loans By Customer: Let’s fill it with median.

Customer Loyalty: There are only two categories: Returning, or New. If a user has only 1 loan, fill New. If it has more than 1 loan, fill ‘Returning’.

Related Downloads

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $39 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.