Customer Lifetime Value Prediction with Lasso Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Acquiring new buyers is costly—so modern firms obsess over customer‑lifetime value (CLV) to decide where to double‑down on retention or upsell. Yet many analysts still rely on rule‑of‑thumb formulas that ignore a shopper’s demographic, engagement, and purchase behaviour. This project builds a Lasso‑regularised linear model that:

  • Forecasts each customer’s future monetary contribution (USD) over a three‑year horizon, using variables such as age, policy type, premium, claim history, and interaction frequency.
  • Selects the handful of truly predictive traits by shrinking weak coefficients to zero—delivering a concise, manager‑friendly set of levers for targeted campaigns.

Lasso’s ℓ1 penalty supplies both built‑in feature selection and an interpretable model—essential when finance and CRM teams need transparent, audit‑ready insights.

Libraries Required

Purpose Library
Data handling pandas, numpy
Visualisation matplotlib, seaborn
ML workflow scikit‑learnLasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV
Evaluation mean_squared_error, r2_score

Dataset Link

Customer Life Time Value

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

The dataset contains auto‑insurance customers with demographics, policy details, coverage, premiums, claims, and a precalculated Customer_Lifetime_Value label for supervised learning.

# One‑time download (needs Kaggle API & key):
# kaggle datasets download -d shibumohapatra/customer-life-time-value -p data --unzip

df = pd.read_csv("data/customer_lifetime_value.csv")   # adjust name if different

3. Basic EDA

print(df.head())
print(df.isna().mean().sort_values(ascending=False).head(10))    # missing‑value audit
sns.histplot(df['Customer_Lifetime_Value'], kde=True); plt.title('CLV distribution'); plt.show()

4. Define target & features

y = df['Customer_Lifetime_Value']          # target in USD
X = df.drop(columns=['Customer_Lifetime_Value', 'Customer_ID'])  # drop leakage / IDs

5. Pre‑processing recipe

OneHotEncoder converts categorical fields (e.g., Gender, Coverage, Vehicle_Class) into dummy variables, dropping the first level to avoid multicollinearity. Numeric columns (e.g., Annual_Premium, Months_Since_Inception, Number_of_Open_Complaints) are z‑scaled so the Lasso penalty treats each on equal footing.

cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
    ('num', StandardScaler(), num_cols)
])

6. Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=df['Policy_Type'])

7. Build & tune Lasso pipeline

Wrapping scaling and modelling in one Pipeline protects against data leakage. The log‑spaced α sweep (0.001–10) finds the best bias‑variance trade‑off; five‑fold CV stabilises the choice.

pipe = Pipeline([
    ('prep', preprocess),
    ('model', Lasso(max_iter=10_000, random_state=42))
])

param_grid = {'model__alpha': np.logspace(-3, 1, 25)}   # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
                      scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['model__alpha'])

8. Evaluate on the hold‑out set

RMSE indicates the average dollar error in CLV prediction, while R² expresses the share of variance captured.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

9. Interpret feature importance

Non‑zero coefficients surviving Lasso’s penalty reveal high‑impact levers—e.g., Coverage type “Premium”, Annual_Premium, Months_Since_Last_Claim. Coefficients at zero signal negligible drivers, helping teams trim data‑collection overhead.

ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coef = search.best_estimator_.named_steps['model'].coef_
importance = (pd.Series(coef, index=feature_names)
                .sort_values(key=abs, ascending=False))

plt.figure(figsize=(9,6))
importance.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of CLV (Lasso Coefficients)')
plt.xlabel('Coefficient (Δ USD)')
plt.show()

Summary

This notebook demonstrates how a Lasso‑based pipeline can transform raw customer and policy data into a transparent, dollar‑level CLV forecast and a ranked list of value drivers. Marketing and retention teams can refresh the model quarterly with new cohorts, instantly see whether strategy shifts boost lifetime value, and focus budgets on the attributes that truly pay off—turning CLV from a rear‑view metric into a forward‑looking compass.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *