Consumer Loan Cost Prediction with Lasso Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Banks issue thousands of personal and retail loans every month, yet pricing those loans in a way that benefits both borrower and lender remains challenging. In this project, we treat a loan’s effective annual interest rate (the borrower’s cost of credit) as the numeric target and build a Lasso‑regularised linear model that:
- Predicts the expected interest rate for a new loan application using applicant demographics, credit history, requested amount, term, and purpose.
- Reveals which borrower attributes drive pricing, because Lasso’s ℓ1 penalty zeroes out weak predictors and keeps the model explainable for risk officers.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Visuals | matplotlib, seaborn |
| ML pipeline | scikit‑learn → Lasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV |
| Metrics | mean_squared_error, r2_score |
Dartaset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
The dataset contains historical consumer‑loan records: borrower profile, requested amount, repayment term, purpose, credit score, employment length, debt‑to‑income ratio, and the interest_rate actually charged.
# One‑time shell command (needs Kaggle API & key):
# kaggle datasets download -d yasserh/loan-default-dataset -p data --unzip
loans = pd.read_csv("data/loan_default_train.csv") # adjust filename if needed
3. Quick EDA
print(loans.head())
print(loans[['interest_rate','loan_amnt','annual_inc']].describe())
sns.histplot(loans['interest_rate'], kde=True); plt.title('Interest‑Rate Distribution'); plt.show()
4. Define target and features
y = loans['interest_rate'] # borrower’s cost in % X = loans.drop(columns=['interest_rate'])
5. Pre‑processing recipe
Pre-processing applies one-hot encoding to categorical variables and z-scales numeric ones, so that the Lasso penalty treats each feature equally.
cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
6. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=loans['loan_status'])
7. Build and tune the Lasso pipeline
Hyper‑parameter search scans 30 α values on a log scale. Small α keeps many predictors; large α enforces sparsity—five-fold CV guards against over‑fitting.
pipe = Pipeline([
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-3, 1, 30)} # 0.001–10
search = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['model__alpha'])
8. Evaluate on the hold‑out set
RMSE reports average pricing error in interest‑rate percentage points, a unit understood by credit teams. R2^2R2 shows what share of pricing variance is captured.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f}% | R²: {r2:.3f}")
9. Interpret coefficients
Non‑zero coefficients spot the levers that matter—e.g. credit score bands, loan purpose “debt‑consolidation”, term length—while zeroed ones can be dropped from data‑collection to cut costs.
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
coefs = search.best_estimator_.named_steps['model'].coef_
imp = (pd.Series(coefs, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,6))
imp.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Loan Pricing (Lasso Coefficients)')
plt.xlabel('Coefficient')
plt.show()
Summary
Using Lasso regression, we built an interpretable pipeline that predicts consumer‑loan interest rates within a tight error band and highlights the most influential borrower traits. Such transparency is critical for compliance with fair-lending regulations and for providing pricing teams with actionable insights. The finished notebook can be refreshed each quarter with new origination data, letting risk managers track pricing drift and recalibrate α without rewriting code.