Customer Lifetime Value Prediction with Lasso Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Acquiring new buyers is costly—so modern firms obsess over customer‑lifetime value (CLV) to decide where to double‑down on retention or upsell. Yet many analysts still rely on rule‑of‑thumb formulas that ignore a shopper’s demographic, engagement, and purchase behaviour. This project builds a Lasso‑regularised linear model that:
- Forecasts each customer’s future monetary contribution (USD) over a three‑year horizon, using variables such as age, policy type, premium, claim history, and interaction frequency.
- Selects the handful of truly predictive traits by shrinking weak coefficients to zero—delivering a concise, manager‑friendly set of levers for targeted campaigns.
Lasso’s ℓ1 penalty supplies both built‑in feature selection and an interpretable model—essential when finance and CRM teams need transparent, audit‑ready insights.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML workflow | scikit‑learn – Lasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV |
| Evaluation | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
The dataset contains auto‑insurance customers with demographics, policy details, coverage, premiums, claims, and a precalculated Customer_Lifetime_Value label for supervised learning.
# One‑time download (needs Kaggle API & key):
# kaggle datasets download -d shibumohapatra/customer-life-time-value -p data --unzip
df = pd.read_csv("data/customer_lifetime_value.csv") # adjust name if different
3. Basic EDA
print(df.head())
print(df.isna().mean().sort_values(ascending=False).head(10)) # missing‑value audit
sns.histplot(df['Customer_Lifetime_Value'], kde=True); plt.title('CLV distribution'); plt.show()
4. Define target & features
y = df['Customer_Lifetime_Value'] # target in USD X = df.drop(columns=['Customer_Lifetime_Value', 'Customer_ID']) # drop leakage / IDs
5. Pre‑processing recipe
OneHotEncoder converts categorical fields (e.g., Gender, Coverage, Vehicle_Class) into dummy variables, dropping the first level to avoid multicollinearity. Numeric columns (e.g., Annual_Premium, Months_Since_Inception, Number_of_Open_Complaints) are z‑scaled so the Lasso penalty treats each on equal footing.
cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
6. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['Policy_Type'])
7. Build & tune Lasso pipeline
Wrapping scaling and modelling in one Pipeline protects against data leakage. The log‑spaced α sweep (0.001–10) finds the best bias‑variance trade‑off; five‑fold CV stabilises the choice.
pipe = Pipeline([
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-3, 1, 25)} # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['model__alpha'])
8. Evaluate on the hold‑out set
RMSE indicates the average dollar error in CLV prediction, while R² expresses the share of variance captured.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
9. Interpret feature importance
Non‑zero coefficients surviving Lasso’s penalty reveal high‑impact levers—e.g., Coverage type “Premium”, Annual_Premium, Months_Since_Last_Claim. Coefficients at zero signal negligible drivers, helping teams trim data‑collection overhead.
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coef = search.best_estimator_.named_steps['model'].coef_
importance = (pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,6))
importance.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of CLV (Lasso Coefficients)')
plt.xlabel('Coefficient (Δ USD)')
plt.show()
Summary
This notebook demonstrates how a Lasso‑based pipeline can transform raw customer and policy data into a transparent, dollar‑level CLV forecast and a ranked list of value drivers. Marketing and retention teams can refresh the model quarterly with new cohorts, instantly see whether strategy shifts boost lifetime value, and focus budgets on the attributes that truly pay off—turning CLV from a rear‑view metric into a forward‑looking compass.