Customer Engagement Cost Prediction with Lasso Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Digital‑marketing teams track likes, clicks, and video views, but they rarely know in advance how much each meaningful interaction will cost once the campaign launches. We aim to build a Lasso‑regularised linear model that:

  • Predicts the engagement cost (USD spent per meaningful interaction) for a planned ad set, using only features available at planning time—channel, audience age‑band, device, bid strategy, impressions, clicks, etc.
  • Selects the few variables that truly drive that cost, because Lasso’s ℓ¹ penalty automatically shrinks weak predictors to zero and keeps the model transparent for media planners.

Libraries Required

Purpose Library
Data handling pandas, numpy
Visualisation matplotlib, seaborn
ML workflow scikit‑learnColumnTransformer, OneHotEncoder, StandardScaler, Pipeline, Lasso, GridSearchCV
Evaluation mean_squared_error, r2_score

Dataset Link

Predict Conversion in Digital Marketing

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

Dataset — the file logs campaign‑level metrics: demographics, impressions, clicks, conversions, total spend.

# one‑time shell command (Kaggle API token required):
# kaggle datasets download -d rabieelkharoua/predict-conversion-in-digital-marketing-dataset -p data --unzip

ads = pd.read_csv("data/marketing_campaign.csv")      # adjust name if necessary

3. Target engineering — Engagement Cost

Assume Total_Investment (USD) and Clicks are columns in the file.

Engagement_Cost = Total_Investment / Clicks, a precise dollar figure that resonates with finance and media teams.

ads = ads.query("Clicks > 0")                         # avoid division by zero
ads['Engagement_Cost'] = ads['Total_Investment'] / ads['Clicks']   # $ / click
y = ads['Engagement_Cost']

4. Feature matrix (X)

X = ads.drop(columns=['Engagement_Cost', 'Clicks', 'Total_Investment', 'ad_id'])
# Keep planning‑time columns only (channel, age, gender, impressions, etc.)

5. Pre‑processing recipe

Categorical attributes (e.g., channel, age‑band, device) become one‑hot dummies; numeric columns (impressions, reach) are z‑scaled so Lasso’s penalty treats each feature fairly.

cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
        ('num', StandardScaler(), num_cols)
    ])

6. Train/test split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=ads['Campaign_Name'])

7. Build & tune Lasso pipeline

Pipeline & CV — wrapping scaler and model prevents data leakage; a log‑spaced α sweep (0.001–10) finds the best trade‑off between bias and sparsity with 5‑fold cross‑validation.

pipe = Pipeline([
        ('prep', preprocess),
        ('model', Lasso(max_iter=10_000, random_state=42))
    ])

param_grid = {'model__alpha': np.logspace(-3, 1, 25)}   # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
                      scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['model__alpha'])

8. Evaluate on the hold‑out set

RMSE shows average prediction error in dollars, while R2R^2 indicates variance explained.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: ${rmse:,.2f} per engagement | R²: {r2:.3f}")

9. Interpret feature importance

Interpretation — non‑zero coefficients spotlight high‑impact levers—maybe high‑CPM channel A to females 25‑34, or mobile device flag. Zeroed coefficients reveal noise, trimming analyst focus.

ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

coef = search.best_estimator_.named_steps['model'].coef_
importance = (pd.Series(coef, index=feature_names)
                .sort_values(key=abs, ascending=False))

plt.figure(figsize=(9,6))
importance.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Engagement Cost (Lasso Coefficients)')
plt.xlabel('Coefficient (Δ USD per engagement)')
plt.show()

Summary

With around a hundred lines of code, we produced an interpretable, cross‑validated Lasso model that:

  • The system forecasts the cost per engagement before a single ad runs.
  • This tool ranks the most expensive drivers, guiding budget reallocation toward efficient segments.
  • Refreshes quickly—thanks to the unified Pipeline, ingesting a new month of data requires only fit().

Deploying this tool empowers marketing teams to move from reactive “report‑and‑adjust” cycles to proactive, dollar‑precise campaign planning.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *