Retail Promotion Sales Prediction using Quantile Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Retailers run promotions to boost sales, but the uplift varies widely—some campaigns barely move the needle (10th percentile), while others go “viral” (90th percentile). Planning inventory and marketing budgets requires anticipating this variability, not just average uplift.

In this project, we’ll predict the 10th, 50th, and 90th percentiles of daily store‐level sales during promotions using the Rossmann Store Sales dataset. By fitting separate quantile regression models with features such as promotional flags, competition distance, store type, and calendar effects, we’ll uncover how drivers influence low-, median-, and high-sales days—helping planners set safety-stock levels, forecast typical demand, and prepare for peak load.

Libraries Required

import pandas as pd                        # Data loading & manipulation  
import numpy as np                         # Numerical operations  
import statsmodels.formula.api as smf      # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Quantile loss metric

Dataset

Rossmann Store Sales

Step-by-Step Code Implementation

Load & Inspect Data

We merge daily sales (train.csv) with store metadata (store.csv), then examine shape, dtypes, and sales distribution—verifying ~1.03 M rows and heavy right skew in Sales.

# Load Rossmann daily sales and store info
sales = pd.read_csv("train.csv")
stores = pd.read_csv("store.csv")

# Merge on store ID
df = sales.merge(stores, on="Store", how="left")

# Quick inspections
print(df.shape)    
print(df.dtypes)   
print(df.Sales.describe())  # distribution of daily sales

Preprocessing & Feature Engineering

We parse the date to extract DayOfWeek, Month, and Year.
Missing CompetitionDistance values are imputed with their maximum to represent “no nearby competitor.”
We one‑hot encode store‐type, assortment‐level, and holiday flags.
We filter to days when the store was open (Open==1) to avoid zero‐sales closures.
We build a feature list—including the Promo flag—and assemble the modelling DataFrame.

# Convert date to datetime and extract calendar features
df['Date'] = pd.to_datetime(df['Date'])
df['DayOfWeek'] = df['Date'].dt.weekday + 1
df['Month']     = df['Date'].dt.month
df['Year']      = df['Date'].dt.year

# Impute missing competition distance with a large sentinel
df['CompetitionDistance'].fillna(df['CompetitionDistance'].max(), inplace=True)

# One-hot encode categorical features
df = pd.get_dummies(df,
    columns=['StoreType','Assortment','StateHoliday'],
    drop_first=True)

# Filter to open days with promotions
df = df[(df.Open==1)]

# Define predictors and target
features = [
    'Promo','CompetitionDistance','DayOfWeek','Month','Year'
] + [c for c in df.columns
     if c.startswith('StoreType_')
     or c.startswith('Assortment_')
     or c.startswith('StateHoliday_')]

data = df[features + ['Sales']].dropna()

Train/Test Split

An 80/20 random split holds out ~206 k records for evaluation, ensuring quantile models generalise to unseen days and stores.

# 80/20 split
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each quantile (10th, 50th, 90th):

We specify a formula relating Sales to all predictors.
We fit a QuantReg model at that percentile to the training data.
We print the coefficient table (.summary().tables[1]), revealing how each feature’s marginal impact differs across low‑, median‑, and high‑sales days (e.g., Promo may lift the 90th percentile by more dollars than the median).

quantiles = [0.10, 0.50, 0.90]
models    = {}
formula   = "Sales ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])

Evaluation with Pinball Loss

We predict quantile‐specific sales on the test set and compute pinball loss for each model—a proper scoring rule for quantile forecasts that penalises under‑ and over‑prediction asymmetrically. Lower pinball loss indicates better‐calibrated quantile estimates.

for q, res in models.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Sales'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

By applying quantile regression to promotional sales data, we obtain distribution‑aware forecasts that inform:

Safety Stock: The 10th‑percentile forecast sets a conservative baseline for replenishment.
Typical Demand: the median forecast (50th percentile) guides day‑to‑day inventory and staffing.
Peak Load: The 90th‑percentile forecast prepares for promotional spikes, preventing stockouts or understaffing.

These insights empower retailers to align inventory, workforce, and budget decisions with the full spectrum of promotional outcomes—optimising service levels and ROI under demand uncertainty.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook