Handling Missing Data - Example - Part 4


Now, let’s work on the Loan Amount Category column. This also has a few missing values. We’re going to fill these missing values using this logic: If the Loan Amount is 1000 and below, we will fill ‘Small’. If it is 2000 and below but more than 1000, its ‘Medium’. If it is above 2000, it is ‘Large’.

To fill the missing values in the Loan Amount Category column based on the value of Loan Amount, you can use the apply function along with a lambda function to check the conditions and assign the appropriate category. Here is the code to do that:

# Define a function to categorize 'LoanAmount'
def categorize_loan_amount(amount):
    if amount <= 1000:
        return 'Small'
    elif amount <= 2000:
        return 'Medium'
    else:  # This means the amount is above 2000
        return 'Large'

# Apply the function to fill missing 'LoanAmountCategory'
loan_data_cleaned['LoanAmountCategory'] = loan_data_cleaned.apply(
    lambda row: categorize_loan_amount(row['LoanAmount']) if pd.isnull(row['LoanAmountCategory']) else row['LoanAmountCategory'],

If the code runs successfully, it will fill the missing categories as per our logic.

We have two more columns to work on – Total Loans By Customer and Customer Loyalty.

Total Loans By Customer

This numeric field represents the number of loans taken by the customer. Missing values can be filled with the mean or median, but if the distribution is skewed or if a significant number of customers have only one loan, using the median or even a default value like 1 might be more appropriate. Let’s fill it with median.

# Fill missing 'TotalLoansByCustomer' with the median
median_total_loans = loan_data_cleaned['TotalLoansByCustomer'].median()
loan_data_cleaned['TotalLoansByCustomer'].fillna(median_total_loans, inplace=True)

Customer Loyalty

There are only two categories: Returning, or New. If a user has only 1 loan, fill ‘New’. If it has more than 1 loan, fill ‘Returning’.

#Fill missing 'CustomerLoyalty'
loan_data_cleaned['CustomerLoyalty'] = loan_data_cleaned.apply(
    lambda row: 'New' if row['TotalLoansByCustomer'] == 1 else 'Returning' if pd.isnull(row['CustomerLoyalty']) else row['CustomerLoyalty'],

With this, we’ve filled all the missing values, and along the way also performed some other interesting transformations.

There’s one more small thing I see that I would like to fix before proceeding. The customer names have different capitalizations. Some are small case, some are uppercase, while the others are in title case. Let’s convert all of them to title case.

To convert all customer names in your DataFrame to title case, you can use the str.title() method available on pandas Series objects. Here's the code to do that:

# Convert 'CustomerName' to title case
loan_data_cleaned['CustomerName'] = loan_data_cleaned['CustomerName'].str.title()

This fixes it.

We will now head to the next important part – Data Transformation and Feature Engineering.

Related Downloads

Learn the skills required to excel in data science and data analytics covering R, Python, machine learning, and AI.

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Saylient AI Logo

Take the Next Step in Your Data Career

Join our membership for lifetime unlimited access to all our data analytics and data science learning content and resources.