Marketing ROI Prediction using Stepwise Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Marketing teams invest in multiple channels—email, social, search, display—with each campaign generating different levels of revenue and cost. Forecasting Return on Investment (ROI) for future campaigns enables better budget allocation and strategy optimization.

In this marketing roi prediction ML project, we will predict Campaign_ROI (revenue ÷ cost) based on campaign inputs such as Impressions, Clicks, Conversions, Channel, Target_Audience_Size, and Duration_Days.

By applying stepwise regression, we’ll isolate the most influential predictors and build an interpretable linear model that balances simplicity with predictive power—helping CMOs maximize marketing efficiency.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Marketing Campaign Performance Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a rich campaign‐performance dataset—including spend, revenue, and key engagement metrics—and preview its structure and summary statistics.

# Block 1: Load dataset  
# Marketing Campaign Performance Dataset – Kaggle :contentReference[oaicite:0]{index=0}  
url = "https://www.kaggle.com/datasets/manishabhatt22/marketing-campaign-performance-dataset/download"  
df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We calculate Campaign_ROI as Revenue_USD / Cost_USD, drop invalid or infinite ratios, and one‑hot encode the Channel categorical field. Predictors include raw counts (Impressions, Clicks, Conversions), audience size, duration, and channel dummies.

# Block 2: Clean & encode  
# Assume columns: 'Impressions','Clicks','Conversions','Cost_USD','Revenue_USD',
# 'Channel','Target_Audience_Size','Duration_Days'

# Compute ROI
df['Campaign_ROI'] = df['Revenue_USD'] / df['Cost_USD']

# Drop invalid or missing records
df = df.replace([np.inf, -np.inf], np.nan).dropna(
    subset=['Impressions','Clicks','Conversions','Cost_USD','Campaign_ROI']
)

# One-hot encode campaign channel
df_enc = pd.get_dummies(df, columns=['Channel'], drop_first=True)

# Define predictors and target
X = df_enc[[
    'Impressions','Clicks','Conversions',
    'Target_Audience_Size','Duration_Days'
] + [col for col in df_enc.columns if col.startswith('Channel_')]]
y = df_enc['Campaign_ROI']

# Train–test split (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our stepwise_selection function alternately adds the excluded feature with the lowest p -value < 0.01 and removes the included feature with p -value > 0.05, iterating until convergence—yielding a concise set of statistically significant predictors.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test each excluded predictor
        excluded = [col for col in X.columns if col not in included]
        pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            pvals[col] = model.pvalues[col]
        best_pval = pvals.min()
        if best_pval < threshold_in:
            best_var = pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step: test removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals_incl = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals_incl.max()
        if worst_pval > threshold_out:
            worst_var = pvals_incl.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression on the selected features via statsmodels. The .summary() output provides coefficient estimates (ROI impact per unit change), p -values (significance), R², and diagnostic metrics (AIC, F‑statistic), clarifying each driver’s effect on ROI.
Predictions on the held‑out test set yield Test R² (variance explained) and RMSE (root‑mean‑square error), quantifying out‑of‑sample predictive performance.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual vs. predicted ROI plot checks for non‑random patterns or heteroscedasticity, validating core linear regression assumptions and ensuring model reliability.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Campaign ROI")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted ROI")
plt.show()

Summary

By applying stepwise regression to campaign‐level data, we isolate the top ROI drivers while pruning less informative predictors.

The resulting linear model creates a balance between interpretability (few, significant features) and predictive accuracy (high test R², low RMSE), equipping marketing leaders with a transparent tool to forecast ROI and optimize budget allocation across channels.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook