Handling Missing Data - Example

LoanAmountCategory

Now, let’s work on the Loan Amount Category column. This also has a few missing values. We’re going to fill these missing values using this logic: If the Loan Amount is 1000 and below, we will fill ‘Small’. If it is 2000 and below but more than 1000, its ‘Medium’. If it is above 2000, it is ‘Large’.

To fill the missing values in the Loan Amount Category column based on the value of Loan Amount, you can use the apply function along with a lambda function to check the conditions and assign the appropriate category. Here is the code to do that:

1# Define a function to categorize 'LoanAmount'
2def categorize_loan_amount(amount):
3    if amount <= 1000:
4        return 'Small'
5    elif amount <= 2000:
6        return 'Medium'
7    else:  # This means the amount is above 2000
8        return 'Large'
9
10# Apply the function to fill missing 'LoanAmountCategory'
11loan_data_cleaned['LoanAmountCategory'] = loan_data_cleaned.apply(
12    lambda row: categorize_loan_amount(row['LoanAmount']) if pd.isnull(row['LoanAmountCategory']) else row['LoanAmountCategory'],
13    axis=1
14)
15loan_data_cleaned.head()
16
17

If the code runs successfully, it will fill the missing categories as per our logic.

We have two more columns to work on – Total Loans By Customer and Customer Loyalty.

Total Loans By Customer

This numeric field represents the number of loans taken by the customer. Missing values can be filled with the mean or median, but if the distribution is skewed or if a significant number of customers have only one loan, using the median or even a default value like 1 might be more appropriate. Let’s fill it with median.

1# Fill missing 'TotalLoansByCustomer' with the median
2median_total_loans = loan_data_cleaned['TotalLoansByCustomer'].median()
3loan_data_cleaned['TotalLoansByCustomer'].fillna(median_total_loans, inplace=True)
4loan_data_cleaned
5
6

Customer Loyalty

There are only two categories: Returning, or New. If a user has only 1 loan, fill ‘New’. If it has more than 1 loan, fill ‘Returning’.

1#Fill missing 'CustomerLoyalty'
2loan_data_cleaned['CustomerLoyalty'] = loan_data_cleaned.apply(
3    lambda row: 'New' if row['TotalLoansByCustomer'] == 1 else 'Returning' if pd.isnull(row['CustomerLoyalty']) else row['CustomerLoyalty'],
4    axis=1
5)
6loan_data_cleaned.head()
7
8

With this, we’ve filled all the missing values, and along the way also performed some other interesting transformations.

There’s one more small thing I see that I would like to fix before proceeding. The customer names have different capitalizations. Some are small case, some are uppercase, while the others are in title case. Let’s convert all of them to title case.

To convert all customer names in your DataFrame to title case, you can use the str.title() method available on pandas Series objects. Here's the code to do that:

1# Convert 'CustomerName' to title case
2loan_data_cleaned['CustomerName'] = loan_data_cleaned['CustomerName'].str.title()
3loan_data_cleaned.head()
4
5

This fixes it.

We will now head to the next important part – Data Transformation and Feature Engineering.

Learn

Resources

Handling Missing Data - Example - Part 4

Handling Missing Data - Example - Part 3 (Non-numeric Values)

Data Transformation and Feature Engineering

Data Manipulation Using Pandas - Part 1

Data Science for Finance Bundle

Topics