Customer Lifetime Value Prediction with Lasso Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Acquiring new buyers is costly—so modern firms obsess over customer‑lifetime value (CLV) to decide where to double‑down on retention or upsell. Yet many analysts still rely on rule‑of‑thumb formulas that ignore a shopper’s demographic, engagement, and purchase behaviour. This project builds a Lasso‑regularised linear model that:

Forecasts each customer’s future monetary contribution (USD) over a three‑year horizon, using variables such as age, policy type, premium, claim history, and interaction frequency.
Selects the handful of truly predictive traits by shrinking weak coefficients to zero—delivering a concise, manager‑friendly set of levers for targeted campaigns.

Lasso’s ℓ1 penalty supplies both built‑in feature selection and an interpretable model—essential when finance and CRM teams need transparent, audit‑ready insights.

Libraries Required

Purpose	Library
Data handling	pandas, numpy
Visualisation	matplotlib, seaborn
ML workflow	scikit‑learn – Lasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV
Evaluation	mean_squared_error, r2_score

Dataset Link

Customer Life Time Value

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

The dataset contains auto‑insurance customers with demographics, policy details, coverage, premiums, claims, and a precalculated Customer_Lifetime_Value label for supervised learning.

# One‑time download (needs Kaggle API & key):
# kaggle datasets download -d shibumohapatra/customer-life-time-value -p data --unzip

df = pd.read_csv("data/customer_lifetime_value.csv")   # adjust name if different

3. Basic EDA

print(df.head())
print(df.isna().mean().sort_values(ascending=False).head(10))    # missing‑value audit
sns.histplot(df['Customer_Lifetime_Value'], kde=True); plt.title('CLV distribution'); plt.show()

4. Define target & features

y = df['Customer_Lifetime_Value']          # target in USD
X = df.drop(columns=['Customer_Lifetime_Value', 'Customer_ID'])  # drop leakage / IDs

5. Pre‑processing recipe

OneHotEncoder converts categorical fields (e.g., Gender, Coverage, Vehicle_Class) into dummy variables, dropping the first level to avoid multicollinearity. Numeric columns (e.g., Annual_Premium, Months_Since_Inception, Number_of_Open_Complaints) are z‑scaled so the Lasso penalty treats each on equal footing.

cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
    ('num', StandardScaler(), num_cols)
])

6. Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=df['Policy_Type'])

7. Build & tune Lasso pipeline

Wrapping scaling and modelling in one Pipeline protects against data leakage. The log‑spaced α sweep (0.001–10) finds the best bias‑variance trade‑off; five‑fold CV stabilises the choice.

pipe = Pipeline([
    ('prep', preprocess),
    ('model', Lasso(max_iter=10_000, random_state=42))
])

param_grid = {'model__alpha': np.logspace(-3, 1, 25)}   # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
                      scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['model__alpha'])

8. Evaluate on the hold‑out set

RMSE indicates the average dollar error in CLV prediction, while R² expresses the share of variance captured.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

9. Interpret feature importance

Non‑zero coefficients surviving Lasso’s penalty reveal high‑impact levers—e.g., Coverage type “Premium”, Annual_Premium, Months_Since_Last_Claim. Coefficients at zero signal negligible drivers, helping teams trim data‑collection overhead.

ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coef = search.best_estimator_.named_steps['model'].coef_
importance = (pd.Series(coef, index=feature_names)
                .sort_values(key=abs, ascending=False))

plt.figure(figsize=(9,6))
importance.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of CLV (Lasso Coefficients)')
plt.xlabel('Coefficient (Δ USD)')
plt.show()

Summary

This notebook demonstrates how a Lasso‑based pipeline can transform raw customer and policy data into a transparent, dollar‑level CLV forecast and a ranked list of value drivers. Marketing and retention teams can refresh the model quarterly with new cohorts, instantly see whether strategy shifts boost lifetime value, and focus budgets on the attributes that truly pay off—turning CLV from a rear‑view metric into a forward‑looking compass.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

Customer Lifetime Value Prediction with Lasso Regression in ML

Libraries Required

Dataset Link

Step-by-Step Code Implementation

1. Import Libraries

2. Download and load the dataset

3. Basic EDA

4. Define target & features

5. Pre‑processing recipe

6. Train/test split

7. Build & tune Lasso pipeline

8. Evaluate on the hold‑out set

9. Interpret feature importance