Loan Approval Prediction Using Random Forest Classifier
This post presents a comprehensive machine learning project for loan approval prediction, focusing on data preprocessing, feature engineering, and classification using a Random Forest model.
💼 Loan Approval Prediction using Random Forest
📌 Project Overview
Automating loan eligibility is a critical task for financial institutions.
In this project, I developed a machine learning pipeline to predict loan approvals based on applicant profiles. The model uses a Random Forest Classifier to evaluate factors such as:
- Credit history
- Income
- Employment status
This helps determine whether a loan should be approved or not.
🛠️ Technical Stack
The project was built using the Python data science ecosystem:
- Data Manipulation:
pandas - Data Visualization:
matplotlib,seaborn - Machine Learning:
scikit-learn
📊 Data Processing Workflow
🔹 1. Data Cleaning & Imputation
Real-world data often contains missing values. These were handled using:
- Categorical Features: Mode
gender,married,dependents,self_employed
- Numerical Features: Median
loanamount,loan_amount_term
- Credit History: Mode (most frequent value)
🔹 2. Feature Engineering
To prepare the dataset for modeling:
- Removed Irrelevant Data:
- Dropped
loan_id(no predictive value)
- Dropped
- Label Encoding:
- Converted
loan_statusinto numeric format
- Converted
- One-Hot Encoding:
- Applied
pd.get_dummies()to categorical variables
- Applied
🤖 Model Development
A Random Forest Classifier was selected due to:
- High accuracy
- Resistance to overfitting
- Ability to handle non-linear relationships
1
2
3
4
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
💻 Implementation (Python Code)
Below is the complete implementation of the project:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
df = pd.read_csv("Loan Approval Prediction dataset.csv")
# Basic info
print(df.head())
print(df.info())
# Handling missing values
for col in ['gender', 'married', 'dependents', 'self_employed']:
df[col].fillna(df[col].mode()[0], inplace=True)
df['loanamount'].fillna(df['loanamount'].median(), inplace=True)
df['loan_amount_term'].fillna(df['loan_amount_term'].median(), inplace=True)
df['credit_history'].fillna(df['credit_history'].mode()[0], inplace=True)
# Drop unnecessary column
df.drop(columns=['loan_id'], inplace=True)
# Visualization
plt.hist(df["applicantincome"])
plt.title("Applicant Income Distribution")
plt.xlabel("Income")
plt.ylabel("Count")
plt.show()
sns.countplot(data=df, x="loan_status")
plt.title("Loan Status Distribution")
plt.show()
# Encoding
le = LabelEncoder()
df['loan_status'] = le.fit_transform(df['loan_status'])
df = pd.get_dummies(df, drop_first=True)
# Splitting data
X = df.drop("loan_status", axis=1)
y = df["loan_status"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Model training
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Predictions
y_pred = rf.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
---
📈 Results
- Achieved an accuracy of 78% on test data
- Model performed well in predicting loan approvals
- Confusion matrix shows balanced classification performance
✅ Conclusion
This project demonstrates how machine learning can automate loan approval decisions efficiently.
The Random Forest model provided reliable predictions and handled the dataset effectively.