- Pandas - Install Python and Pandas
- Basic Data Structures in Pandas
- Loading and Saving Data using Pandas
- Exploring Data using pandas
- Correlation Analysis using pandas
- Handling Categorical Data and Unique Values using pandas
- Data Visualization using pandas
- Handling Missing Data in Python
- Strategies for Handling Missing Data
- Handling Missing Data - Example - Part 1
- Handling Missing Data - Example - Part 2
- Handling Missing Data - Example - Part 3 (Non-numeric Values)
- Handling Missing Data - Example - Part 4
- Data Transformation and Feature Engineering
- Converting Data Types in Python pandas
- Encoding Categorical Data in Python pandas
- Handling Date and Time Data in Python pandas
- Renaming Columns in Python pandas
- Filtering Rows in a DataFrame in Python
- Merging and Joining Datasets in Python pandas
- Sorting and Indexing Data for Efficient Analysis in Python
Handling Missing Data - Example - Part 2
Let’s start with LoanDurationDays. Since we are filling missing values with the number of days between LoanStartDate and LoanEndDate, it’s important that we first ensure that the dates are in correct format.
This code snippet cleans and standardizes date data in a DataFrame, specifically in the 'LoanStartDate' column. It defines a function, parse_dates, to convert dates into a consistent format, handling two common formats and marking unparseable dates as missing (pd.NaT). This function is applied to the 'LoanStartDate' column of loan_data_cleaned.
# Fix issues with start date and end date
loan_data_cleaned = loan_data_cleaned.copy()
from dateutil.parser import parse
def parse_dates(date):
for fmt in ("%d/%m/%y", "%Y-%m-%d"): # Add more formats here if needed
try:
return pd.to_datetime(date, format=fmt)
except ValueError:
continue
return pd.NaT # Return NaT for unparseable dates
# Apply the custom parse function to the 'LoanStartDate' column
loan_data_cleaned['LoanStartDate'] = loan_data_cleaned['LoanStartDate'].apply(parse_dates)
# Check for any NaT values indicating invalid or unparseable dates
invalid_dates = loan_data_cleaned[loan_data_cleaned['LoanStartDate'].isna()]
print("Invalid or unparseable dates:\n", invalid_dates[['CustomerName', 'LoanStartDate']])
We will apply the same date transformation to LoanEndDate also.
# Apply the custom parse function to the 'LoanStartDate' column
loan_data_cleaned['LoanEndDate'] = loan_data_cleaned['LoanEndDate'].apply(parse_dates)
loan_data_cleaned.head()
At this stage all the dates have been formatted to a common format, and invalid dates have been marked is NoT. We should decide how to handle these NaT values before calculating the LoanDurationDays. Typically, if you do not have a reliable way to determine what the correct start date should be for the entries with NaT, you might consider:
- Removing those rows, especially if they are a small fraction of the data.
- Imputing them, if you have a business rule or logic for imputation.
In our case we will remove the rows containing NaT values.
# Remove rows where 'LoanStartDate' is NaT
loan_data_cleaned = loan_data_cleaned.dropna(subset=['LoanStartDate'])
# Check the result
loan_data_cleaned['LoanStartDate'].isna().sum(), loan_data_cleaned.shape
loan_data_cleaned.head()
Once you've handled the NaT values, here's how you could fill in the missing LoanDurationDays with the number of days between the start and end date:
# Assuming 'LoanStartDate' and 'LoanEndDate' have been converted to datetime and
# any NoT or incorrect dates have been handled appropriately.
# Calculate the duration only for the rows where 'LoanDurationDays' is missing
loan_data_cleaned['LoanDurationDays'] = loan_data_cleaned.apply(
lambda row: (row['LoanEndDate'] - row['LoanStartDate']).days
if pd.isnull(row['LoanDurationDays']) else row['LoanDurationDays'],
axis=1
)
Make sure both LoanStartDate and LoanEndDate are in the datetime format before performing this operation. If they are not, convert them using pd.to_datetime()
. The lambda function in the apply method calculates the duration only for the rows with missing LoanDurationDays. If LoanDurationDays is not missing, the original value is retained.
Related Downloads
Data Science in Finance: 9-Book Bundle
Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.
What's Included:
- Getting Started with R
- R Programming for Data Science
- Data Visualization with R
- Financial Time Series Analysis with R
- Quantitative Trading Strategies with R
- Derivatives with R
- Credit Risk Modelling With R
- Python for Data Science
- Machine Learning in Finance using Python
Each book includes PDFs, explanations, instructions, data files, and R code for all examples.
Get the Bundle for $39 (Regular $57)Free Guides - Getting Started with R and Python
Enter your name and email address below and we will email you the guides for R programming and Python.