Handling Missing Data - Example - Part 4

LoanAmountCategory

Now, let’s work on the Loan Amount Category column. This also has a few missing values. We’re going to fill these missing values using this logic: If the Loan Amount is 1000 and below, we will fill ‘Small’. If it is 2000 and below but more than 1000, its ‘Medium’. If it is above 2000, it is ‘Large’.

To fill the missing values in the Loan Amount Category column based on the value of Loan Amount, you can use the apply function along with a lambda function to check the conditions and assign the appropriate category. Here is the code to do that:

# Define a function to categorize 'LoanAmount'
def categorize_loan_amount(amount):
    if amount <= 1000:
        return 'Small'
    elif amount <= 2000:
        return 'Medium'
    else:  # This means the amount is above 2000
        return 'Large'

# Apply the function to fill missing 'LoanAmountCategory'
loan_data_cleaned['LoanAmountCategory'] = loan_data_cleaned.apply(
    lambda row: categorize_loan_amount(row['LoanAmount']) if pd.isnull(row['LoanAmountCategory']) else row['LoanAmountCategory'],
    axis=1
)
loan_data_cleaned.head()

If the code runs successfully, it will fill the missing categories as per our logic.

We have two more columns to work on – Total Loans By Customer and Customer Loyalty.

Total Loans By Customer

This numeric field represents the number of loans taken by the customer. Missing values can be filled with the mean or median, but if the distribution is skewed or if a significant number of customers have only one loan, using the median or even a default value like 1 might be more appropriate. Let’s fill it with median.

# Fill missing 'TotalLoansByCustomer' with the median
median_total_loans = loan_data_cleaned['TotalLoansByCustomer'].median()
loan_data_cleaned['TotalLoansByCustomer'].fillna(median_total_loans, inplace=True)
loan_data_cleaned

Customer Loyalty

There are only two categories: Returning, or New. If a user has only 1 loan, fill ‘New’. If it has more than 1 loan, fill ‘Returning’.

#Fill missing 'CustomerLoyalty'
loan_data_cleaned['CustomerLoyalty'] = loan_data_cleaned.apply(
    lambda row: 'New' if row['TotalLoansByCustomer'] == 1 else 'Returning' if pd.isnull(row['CustomerLoyalty']) else row['CustomerLoyalty'],
    axis=1
)
loan_data_cleaned.head()

With this, we’ve filled all the missing values, and along the way also performed some other interesting transformations.

There’s one more small thing I see that I would like to fix before proceeding. The customer names have different capitalizations. Some are small case, some are uppercase, while the others are in title case. Let’s convert all of them to title case.

To convert all customer names in your DataFrame to title case, you can use the str.title() method available on pandas Series objects. Here's the code to do that:

# Convert 'CustomerName' to title case
loan_data_cleaned['CustomerName'] = loan_data_cleaned['CustomerName'].str.title()
loan_data_cleaned.head()

This fixes it.

We will now head to the next important part – Data Transformation and Feature Engineering.

Related Downloads

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book includes PDFs, explanations, instructions, data files, and R code for all examples.

Get the Bundle for $29 (Regular $57)
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.