Encoding Categorical Data in Python pandas

Premium

We have three columns with categorical data: LoanStatus, LoanAmountCategory, and CustomerLoyalty. To demonstrate encoding, we will apply it to the LoanStatus column. Since the values in LoanStatus are nominal without any intrinsic order, one-hot encoding is the appropriate technique. It avoids any ordinal implications that label encoding might introduce.

Before we do this, let’s check the various values in this column to ensure that there are no discrepancies. The following code gets us the unique values in LoanStatus column.

1# print unique values in LoanStatus column
2print(loan_data_cleaned['LoanStatus'].unique())
3
4# ['pending', 'rejected', 'completed', 'approved', 'overdue', 'apprved', 'rejcted']
5Categories (7, object): ['approved', 'apprved', 'completed', 'overdue', 'pending', 'rejcted', 'rejected']
6

We have a problem. Instead of the expected 5 categories, we have 7. This is because we have spelling mistakes in our data. We need to fix this.

To fix the spelling mistakes in the LoanStatus column of our loan_data_copy DataFrame, we can use the .replace() method in Pandas. We want to correct 'apprved' to 'approved' and 'rejcted' to 'rejected'.

1# Correcting the spelling mistakes
2loan_data_cleaned['LoanStatus'] = loan_data_cleaned['LoanStatus'].replace({'apprved': 'approved', 'rejcted': 'rejected'})
3
4# Verify the changes by printing unique values again
5print(loan_data_cleaned['LoanStatus'].unique())
6
7# ['pending', 'rejected', 'completed', 'approved', 'overdue']
8Categories (5, object): ['approved', 'completed', 'overdue', 'pending', 'rejected']
9

Now, we can perform one-hot encoding on the LoanStatus column using pandas:

1# Perform one-hot encoding on 'LoanStatus'
2loan_status_encoded = pd.get_dummies(loan_data_cleaned['LoanStatus'], prefix='Status')
3
4# Join the encoded DataFrame with the original one, dropping the original 'LoanStatus' column
5loan_data_cleaned = loan_data_cleaned.join(loan_status_encoded)
6
7# Optionally, you can drop the original 'LoanStatus' column if you no longer need it
8loan_data_cleaned.drop('LoanStatus', axis=1, inplace=True)
9
10# Verify the changes
11loan_data_cleaned.head()
12
Loan Data Encoded Loan Status
Loan Data Encoded Loan Status