Encoding Categorical Data in Python pandas

We have three columns with categorical data: LoanStatus, LoanAmountCategory, and CustomerLoyalty. To demonstrate encoding, we will apply it to the LoanStatus column. Since the values in LoanStatus are nominal without any intrinsic order, one-hot encoding is the appropriate technique. It avoids any ordinal implications that label encoding might introduce.

Before we do this, let’s check the various values in this column to ensure that there are no discrepancies. The following code gets us the unique values in LoanStatus column.

# print unique values in LoanStatus column
print(loan_data_cleaned['LoanStatus'].unique())

# ['pending', 'rejected', 'completed', 'approved', 'overdue', 'apprved', 'rejcted']
Categories (7, object): ['approved', 'apprved', 'completed', 'overdue', 'pending', 'rejcted', 'rejected']

We have a problem. Instead of the expected 5 categories, we have 7. This is because we have spelling mistakes in our data. We need to fix this.

To fix the spelling mistakes in the 'LoanStatus' column of our loan_data_copy DataFrame, we can use the .replace() method in Pandas. We want to correct 'apprved' to 'approved' and 'rejcted' to 'rejected'.

# Correcting the spelling mistakes
loan_data_cleaned['LoanStatus'] = loan_data_cleaned['LoanStatus'].replace({'apprved': 'approved', 'rejcted': 'rejected'})

# Verify the changes by printing unique values again
print(loan_data_cleaned['LoanStatus'].unique())

# ['pending', 'rejected', 'completed', 'approved', 'overdue']
Categories (5, object): ['approved', 'completed', 'overdue', 'pending', 'rejected']

Now, we can perform one-hot encoding on the LoanStatus column using pandas:

# Perform one-hot encoding on 'LoanStatus'
loan_status_encoded = pd.get_dummies(loan_data_cleaned['LoanStatus'], prefix='Status')

# Join the encoded DataFrame with the original one, dropping the original 'LoanStatus' column
loan_data_cleaned = loan_data_cleaned.join(loan_status_encoded)

# Optionally, you can drop the original 'LoanStatus' column if you no longer need it
loan_data_cleaned.drop('LoanStatus', axis=1, inplace=True)

# Verify the changes
loan_data_cleaned.head()

You may find these interesting

Operational Risk Data
For any bank, the measurement and management of operational risk is of prime importance. One of the...

Related Downloads

Finance Train Premium
Accelerate your finance career with cutting-edge data skills.
Join Finance Train Premium for unlimited access to a growing library of ebooks, projects and code examples covering financial modeling, data analysis, data science, machine learning, algorithmic trading strategies, and more applied to real-world finance scenarios.
I WANT TO JOIN
JOIN 30,000 DATA PROFESSIONALS

Free Guides - Getting Started with R and Python

Enter your name and email address below and we will email you the guides for R programming and Python.

Data Science in Finance: 9-Book Bundle

Data Science in Finance Book Bundle

Master R and Python for financial data science with our comprehensive bundle of 9 ebooks.

What's Included:

  • Getting Started with R
  • R Programming for Data Science
  • Data Visualization with R
  • Financial Time Series Analysis with R
  • Quantitative Trading Strategies with R
  • Derivatives with R
  • Credit Risk Modelling With R
  • Python for Data Science
  • Machine Learning in Finance using Python

Each book comes with PDFs, detailed explanations, step-by-step instructions, data files, and complete downloadable R code for all examples.