Let’s start with LoanDurationDays. Since we are filling missing values with the number of days between LoanStartDate and LoanEndDate, it’s important that we first ensure that the dates are in correct format.
This code snippet cleans and standardizes date data in a DataFrame, specifically in the 'LoanStartDate' column. It defines a function, parse_dates, to convert dates into a consistent format, handling two common formats and marking unparseable dates as missing (pd.NaT). This function is applied to the 'LoanStartDate' column of loan_data_cleaned.
1# Fix issues with start date and end date2loan_data_cleaned = loan_data_cleaned.copy()3from dateutil.parser import parse
45defparse_dates(date):6for fmt in("%d/%m/%y","%Y-%m-%d"):# Add more formats here if needed7try:8return pd.to_datetime(date,format=fmt)9except ValueError:10continue11return pd.NaT # Return NaT for unparseable dates1213# Apply the custom parse function to the 'LoanStartDate' column14loan_data_cleaned['LoanStartDate']= loan_data_cleaned['LoanStartDate'].apply(parse_dates)1516# Check for any NaT values indicating invalid or unparseable dates17invalid_dates = loan_data_cleaned[loan_data_cleaned['LoanStartDate'].isna()]18print("Invalid or unparseable dates:\n", invalid_dates[['CustomerName','LoanStartDate']])1920
We will apply the same date transformation to LoanEndDate also.
1# Apply the custom parse function to the 'LoanStartDate' column2loan_data_cleaned['LoanEndDate']= loan_data_cleaned['LoanEndDate'].apply(parse_dates)3loan_data_cleaned.head()45
At this stage all the dates have been formatted to a common format, and invalid dates have been marked is NoT. We should decide how to handle these NaT values before calculating the LoanDurationDays. Typically, if you do not have a reliable way to determine what the correct start date should be for the entries with NaT, you might consider:
Removing those rows, especially if they are a small fraction of the
data.
Imputing them, if you have a business rule or logic for imputation.
In our case we will remove the rows containing NaT values.
1# Remove rows where 'LoanStartDate' is NaT2loan_data_cleaned = loan_data_cleaned.dropna(subset=['LoanStartDate'])34# Check the result5loan_data_cleaned['LoanStartDate'].isna().sum(), loan_data_cleaned.shape
6loan_data_cleaned.head()78
Once you've handled the NaT values, here's how you could fill in the missing LoanDurationDays with the number of days between the start and end date:
1# Assuming 'LoanStartDate' and 'LoanEndDate' have been converted to datetime and2# any NoT or incorrect dates have been handled appropriately.34# Calculate the duration only for the rows where 'LoanDurationDays' is missing5loan_data_cleaned['LoanDurationDays']= loan_data_cleaned.apply(6lambda row:(row['LoanEndDate']- row['LoanStartDate']).days
7if pd.isnull(row['LoanDurationDays'])else row['LoanDurationDays'],8 axis=19)10
Make sure both LoanStartDate and LoanEndDate are in the datetime format before performing this operation. If they are not, convert them using pd.to_datetime(). The lambda function in the apply method calculates the duration only for the rows with missing LoanDurationDays. If LoanDurationDays is not missing, the original value is retained.
Unlock Premium Content
Upgrade your account to access the full article, downloads, and exercises.