Customer Churn Cost Prediction with Lasso Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Telecom providers know who is likely to churn, but the monetary impact of each departing customer often remains vague. Without a dollar tag, retention budgets cannot be prioritised efficiently. This project builds a Lasso‑regularised linear model that:
- Forecasts the potential revenue loss (USD) if a current subscriber churns tomorrow, using demographic, service‑usage, and billing attributes available today.
- Zeroes out weak predictors via Lasso’s ℓ¹ penalty, surfacing the handful of levers that most influence churn cost and deserve proactive incentives.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visualization | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, Pipeline, Lasso, GridSearchCV |
| Metrics | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
7,043 telecom subscribers with demographics, service bundles, billing info, and a churn flag.
# one‑time shell command (Kaggle API key required):
# kaggle datasets download -d blastchar/telco-customer-churn -p data --unzip
df = pd.read_csv("data/Telco-Customer-Churn.csv") # 7 043 rows, 21 columns
3. Engineer the ‘churn cost’ target
Assumption: If a subscriber quits today, the operator loses 20% of their monthly bill for every month remaining in a typical three-year lifetime (36 months).
AVG_LIFETIME = 36 # months INCENTIVE_RATE = 0.20 # 20 % of monthly charges # Clean tenure (months already served) df['tenure'] = pd.to_numeric(df['tenure'], errors='coerce').fillna(0).astype(int) df['remaining_months'] = (AVG_LIFETIME - df['tenure']).clip(lower=0) # Monetary impact *if* customer churns now df['churn_cost'] = df['MonthlyCharges'] * INCENTIVE_RATE * df['remaining_months']
4. Define features and target
churn_cost estimates revenue at risk if a customer leaves today: MonthlyCharges × INCENTIVE_RATE × remaining_months. Rate (20%) and lifecycle length (36 months) are tunable business assumptions.
y = df['churn_cost'] # continuous USD value X = df.drop(columns=['churn_cost', 'customerID'])
5. Pre‑processing recipe
OneHotEncoder converts categorical variables, while StandardScaler normalises numerics. Encapsulating both inside a Pipeline prevents data leakage.
cat_cols = X.select_dtypes('object').columns # e.g. gender, contract
num_cols = X.select_dtypes(exclude='object').columns # tenure, charges …
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
6. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['Churn'])
7. Build & tune Lasso pipeline
log‑spaced α search (0.001–10) balances sparsity and prediction error; five‑fold CV chooses the best trade‑off.
pipe = Pipeline([
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-3, 1, 25)} # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1)
search.fit(X_train, y_train)
print("Optimal α:", search.best_params_['model__alpha'])
8. Evaluate on the hold‑out set
RMSE expresses average dollar error; R2R^2 shows variance explained.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
9. Interpret feature importance
non‑zero coefficients highlight high‑impact levers—e.g., “Contract = month‑to‑month” or high monthly charges. Zero coefficients flag features that, given others, do not influence lost revenue.
# Retrieve one‑hot column names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
coefs = search.best_estimator_.named_steps['model'].coef_
imp = (pd.Series(coefs, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,6))
imp.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Churn Cost (Lasso Coefficients)')
plt.xlabel('Coefficient (Δ USD)')
plt.show()
Summary
This Lasso-based pipeline converts raw telecom data into a dollar-level forecast of churn impact and a ranked list of cost drivers. Retention teams can:
- Prioritise expensive‑to‑lose customers for proactive offers.
- Budget incentives based on expected ROI, not guesswork.
- Refresh the model quarterly—thanks to the all-in-one Pipeline, a new fit is just one line of code.
By tying churn directly to money, the organisation moves from generic “save them all” tactics to precision‑guided retention economics.