Customer Engagement Cost Prediction with Lasso Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Digital‑marketing teams track likes, clicks, and video views, but they rarely know in advance how much each meaningful interaction will cost once the campaign launches. We aim to build a Lasso‑regularised linear model that:
- Predicts the engagement cost (USD spent per meaningful interaction) for a planned ad set, using only features available at planning time—channel, audience age‑band, device, bid strategy, impressions, clicks, etc.
- Selects the few variables that truly drive that cost, because Lasso’s ℓ¹ penalty automatically shrinks weak predictors to zero and keeps the model transparent for media planners.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, Pipeline, Lasso, GridSearchCV |
| Evaluation | mean_squared_error, r2_score |
Dataset Link
Predict Conversion in Digital Marketing
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
Dataset — the file logs campaign‑level metrics: demographics, impressions, clicks, conversions, total spend.
# one‑time shell command (Kaggle API token required):
# kaggle datasets download -d rabieelkharoua/predict-conversion-in-digital-marketing-dataset -p data --unzip
ads = pd.read_csv("data/marketing_campaign.csv") # adjust name if necessary
3. Target engineering — Engagement Cost
Assume Total_Investment (USD) and Clicks are columns in the file.
Engagement_Cost = Total_Investment / Clicks, a precise dollar figure that resonates with finance and media teams.
ads = ads.query("Clicks > 0") # avoid division by zero
ads['Engagement_Cost'] = ads['Total_Investment'] / ads['Clicks'] # $ / click
y = ads['Engagement_Cost']
4. Feature matrix (X)
X = ads.drop(columns=['Engagement_Cost', 'Clicks', 'Total_Investment', 'ad_id']) # Keep planning‑time columns only (channel, age, gender, impressions, etc.)
5. Pre‑processing recipe
Categorical attributes (e.g., channel, age‑band, device) become one‑hot dummies; numeric columns (impressions, reach) are z‑scaled so Lasso’s penalty treats each feature fairly.
cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
6. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=ads['Campaign_Name'])
7. Build & tune Lasso pipeline
Pipeline & CV — wrapping scaler and model prevents data leakage; a log‑spaced α sweep (0.001–10) finds the best trade‑off between bias and sparsity with 5‑fold cross‑validation.
pipe = Pipeline([
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-3, 1, 25)} # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['model__alpha'])
8. Evaluate on the hold‑out set
RMSE shows average prediction error in dollars, while R2R^2 indicates variance explained.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.2f} per engagement | R²: {r2:.3f}")
9. Interpret feature importance
Interpretation — non‑zero coefficients spotlight high‑impact levers—maybe high‑CPM channel A to females 25‑34, or mobile device flag. Zeroed coefficients reveal noise, trimming analyst focus.
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
coef = search.best_estimator_.named_steps['model'].coef_
importance = (pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,6))
importance.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Engagement Cost (Lasso Coefficients)')
plt.xlabel('Coefficient (Δ USD per engagement)')
plt.show()
Summary
With around a hundred lines of code, we produced an interpretable, cross‑validated Lasso model that:
- The system forecasts the cost per engagement before a single ad runs.
- This tool ranks the most expensive drivers, guiding budget reallocation toward efficient segments.
- Refreshes quickly—thanks to the unified Pipeline, ingesting a new month of data requires only fit().
Deploying this tool empowers marketing teams to move from reactive “report‑and‑adjust” cycles to proactive, dollar‑precise campaign planning.