Retail Promotion Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Retailers run promotions (discounts, coupons, multi‑buy offers) to boost sales, but these campaigns incur costs—both direct (rebate value, printing/distribution) and indirect (increased handling, spoilage). In this project, we will predict the total promotion cost for individual campaigns based on features such as promotional type, discount rate, expected uplift in units sold, product category, store location, and campaign duration.
By applying stepwise regression, we’ll isolate the most influential cost drivers and build an interpretable linear model that balances simplicity with predictive performance—helping marketing teams budget promotions more accurately and maximise ROI.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Ordinary Least Squares regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Cost Prediction for Acquiring Customers
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a Food Mart promotions dataset containing ~60,000 campaigns, each with product category, store, promotion type, discount rate, expected sales uplift, duration, and observed cost.
# Block 1: Load dataset
# Media Campaign Cost Prediction – Food Mart (60K campaigns) :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/ramjasmaurya/medias-cost-prediction-in-foodmart/download"
df = pd.read_csv(url)
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
Rows missing any core field are removed. We one‑hot encode categorical predictors (Product_Category, Promo_Type, Store_ID) to prepare for regression. We separate predictors (X) from the target cost (y) and perform an 80/20 train/test split.
# Block 2: Clean & encode
# Assume columns include: 'Campaign_ID', 'Product_Category', 'Store_ID',
# 'Promo_Type', 'Discount_Rate', 'Expected_Uplift', 'Duration_Days', 'Cost_USD'
# Drop any rows with missing critical values
df = df.dropna(subset=[
'Product_Category','Store_ID','Promo_Type',
'Discount_Rate','Expected_Uplift','Duration_Days','Cost_USD'
])
# One‑hot encode categorical columns
df_enc = pd.get_dummies(df,
columns=['Product_Category','Promo_Type','Store_ID'],
drop_first=True)
# Define predictors and target
X = df_enc.drop(['Campaign_ID','Cost_USD'], axis=1)
y = df_enc['Cost_USD']
# Split into training and testing sets (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function alternates forward inclusion (adding predictors with p < 0.01) and backward elimination (dropping predictors with p > 0.05) until no further changes, yielding a concise set of statistically significant features.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:25} p-value {best_pval:.4f}")
# Backward step: test each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The .summary() output provides coefficient estimates (cost impact per unit change), p -values (significance), R², and diagnostic statistics (AIC, F‑statistic), enabling interpretation of each driver’s effect on promotion cost.
- Predictions on unseen test data yield R² (explained variance) and RMSE (average error magnitude), quantifying model generalisation.
# Block 4: Perform stepwise feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A scatter plot of residuals vs. predicted costs checks for patterns or heteroscedasticity—key OLS assumptions—ensuring model validity.
# Block 5: Plot residuals to check assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Promotion Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
Applying stepwise regression to retail promotion data isolates the major cost drivers—such as discount rate, expected uplift, campaign duration, and specific promo/store categories—while pruning redundant variables. The resulting linear model strikes a strong balance between interpretability (precise coefficient estimates and p-values) and predictive accuracy (high test‑set R², low RMSE), equipping retail marketers with a transparent tool to forecast promotion costs and optimise campaign planning.