- Pandas - Install Python and Pandas
- Basic Data Structures in Pandas
- Loading and Saving Data using Pandas
- Exploring Data using pandas
- Correlation Analysis using pandas
- Handling Categorical Data and Unique Values using pandas
- Data Visualization using pandas
- Handling Missing Data in Python
- Strategies for Handling Missing Data
- Handling Missing Data - Example - Part 1
- Handling Missing Data - Example - Part 2
- Handling Missing Data - Example - Part 3 (Non-numeric Values)
- Handling Missing Data - Example - Part 4
- Data Transformation and Feature Engineering
- Converting Data Types in Python pandas
- Encoding Categorical Data in Python pandas
- Handling Date and Time Data in Python pandas
- Renaming Columns in Python pandas
- Filtering Rows in a DataFrame in Python
- Merging and Joining Datasets in Python pandas
- Sorting and Indexing Data for Efficient Analysis in Python
Handling Missing Data - Example - Part 4
LoanAmountCategory
Now, let’s work on the Loan Amount Category column. This also has a few missing values. We’re going to fill these missing values using this logic: If the Loan Amount is 1000 and below, we will fill ‘Small’. If it is 2000 and below but more than 1000, its ‘Medium’. If it is above 2000, it is ‘Large’.
To fill the missing values in the Loan Amount Category column based on the value of Loan Amount, you can use the apply function along with a lambda function to check the conditions and assign the appropriate category. Here is the code to do that:
# Define a function to categorize 'LoanAmount'
def categorize_loan_amount(amount):
if amount <= 1000:
return 'Small'
elif amount <= 2000:
return 'Medium'
else: # This means the amount is above 2000
return 'Large'
# Apply the function to fill missing 'LoanAmountCategory'
loan_data_cleaned['LoanAmountCategory'] = loan_data_cleaned.apply(
lambda row: categorize_loan_amount(row['LoanAmount']) if pd.isnull(row['LoanAmountCategory']) else row['LoanAmountCategory'],
axis=1
)
loan_data_cleaned.head()
If the code runs successfully, it will fill the missing categories as per our logic.
We have two more columns to work on – Total Loans By Customer and Customer Loyalty.
Total Loans By Customer
This numeric field represents the number of loans taken by the customer. Missing values can be filled with the mean or median, but if the distribution is skewed or if a significant number of customers have only one loan, using the median or even a default value like 1 might be more appropriate. Let’s fill it with median.
# Fill missing 'TotalLoansByCustomer' with the median
median_total_loans = loan_data_cleaned['TotalLoansByCustomer'].median()
loan_data_cleaned['TotalLoansByCustomer'].fillna(median_total_loans, inplace=True)
loan_data_cleaned
Customer Loyalty
There are only two categories: Returning, or New. If a user has only 1 loan, fill ‘New’. If it has more than 1 loan, fill ‘Returning’.
#Fill missing 'CustomerLoyalty'
loan_data_cleaned['CustomerLoyalty'] = loan_data_cleaned.apply(
lambda row: 'New' if row['TotalLoansByCustomer'] == 1 else 'Returning' if pd.isnull(row['CustomerLoyalty']) else row['CustomerLoyalty'],
axis=1
)
loan_data_cleaned.head()
With this, we’ve filled all the missing values, and along the way also performed some other interesting transformations.
There’s one more small thing I see that I would like to fix before proceeding. The customer names have different capitalizations. Some are small case, some are uppercase, while the others are in title case. Let’s convert all of them to title case.
To convert all customer names in your DataFrame to title case, you can use the str.title() method available on pandas Series objects. Here's the code to do that:
# Convert 'CustomerName' to title case
loan_data_cleaned['CustomerName'] = loan_data_cleaned['CustomerName'].str.title()
loan_data_cleaned.head()
This fixes it.
We will now head to the next important part – Data Transformation and Feature Engineering.
Related Downloads
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $29 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.