Consumer Loan Cost Prediction with Lasso Regression in ML

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

Banks issue thousands of personal and retail loans every month, yet pricing those loans in a way that benefits both borrower and lender remains challenging. In this project, we treat a loan’s effective annual interest rate (the borrower’s cost of credit) as the numeric target and build a Lasso‑regularised linear model that:

  • Predicts the expected interest rate for a new loan application using applicant demographics, credit history, requested amount, term, and purpose.
  • Reveals which borrower attributes drive pricing, because Lasso’s ℓ1 penalty zeroes out weak predictors and keeps the model explainable for risk officers.

Libraries Required

Purpose Library
Data handling pandas, numpy
Visuals matplotlib, seaborn
ML pipeline scikit‑learnLasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV
Metrics mean_squared_error, r2_score

Dartaset Link

Loan Default Dataset

Step-by-Step Code Implementation

1.  Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

The dataset contains historical consumer‑loan records: borrower profile, requested amount, repayment term, purpose, credit score, employment length, debt‑to‑income ratio, and the interest_rate actually charged.

# One‑time shell command (needs Kaggle API & key):
# kaggle datasets download -d yasserh/loan-default-dataset -p data --unzip

loans = pd.read_csv("data/loan_default_train.csv")   # adjust filename if needed

3. Quick EDA

print(loans.head())
print(loans[['interest_rate','loan_amnt','annual_inc']].describe())
sns.histplot(loans['interest_rate'], kde=True); plt.title('Interest‑Rate Distribution'); plt.show()

4.  Define target and features

y = loans['interest_rate']           # borrower’s cost in %
X = loans.drop(columns=['interest_rate'])

5. Pre‑processing recipe

Pre-processing applies one-hot encoding to categorical variables and z-scales numeric ones, so that the Lasso penalty treats each feature equally.

cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
    ('num', StandardScaler(), num_cols)
])

6. Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=loans['loan_status'])

7. Build and tune the Lasso pipeline

Hyper‑parameter search scans 30 α values on a log scale. Small α keeps many predictors; large α enforces sparsity—five-fold CV guards against over‑fitting.

pipe = Pipeline([
    ('prep', preprocess),
    ('model', Lasso(max_iter=10_000, random_state=42))
])

param_grid = {'model__alpha': np.logspace(-3, 1, 30)}   # 0.001–10
search = GridSearchCV(pipe, param_grid, cv=5,
                      scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['model__alpha'])

8. Evaluate on the hold‑out set

RMSE reports average pricing error in interest‑rate percentage points, a unit understood by credit teams. R2^2R2 shows what share of pricing variance is captured.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f}% | R²: {r2:.3f}")

9.  Interpret coefficients

Non‑zero coefficients spot the levers that matter—e.g. credit score bands, loan purpose “debt‑consolidation”, term length—while zeroed ones can be dropped from data‑collection to cut costs.

ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

coefs = search.best_estimator_.named_steps['model'].coef_
imp = (pd.Series(coefs, index=feature_names)
         .sort_values(key=abs, ascending=False))

plt.figure(figsize=(9,6))
imp.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Loan Pricing (Lasso Coefficients)')
plt.xlabel('Coefficient')
plt.show()

Summary

Using Lasso regression, we built an interpretable pipeline that predicts consumer‑loan interest rates within a tight error band and highlights the most influential borrower traits. Such transparency is critical for compliance with fair-lending regulations and for providing pricing teams with actionable insights. The finished notebook can be refreshed each quarter with new origination data, letting risk managers track pricing drift and recalibrate α without rewriting code.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *