Retail Sales Prediction using Quantile Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Most retail forecasting models target the average expected sales, but retailers must also prepare for variability—knowing what sales look like in weak (10th percentile) or strong (90th percentile) periods. In this project, we’ll predict multiple quantiles (e.g. 10th, 50th, 90th percentiles) of daily store sales using features such as store type, promotion flags, competition distance, and calendar variables. By fitting separate quantile regression models for each percentile, we reveal how drivers impact sales across the distribution—enabling supply‑chain planners to stock safely in downturns and capitalise on peak demand.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We read train.csv (daily sales) and store.csv (store metadata) and merge on Store. Initial .info() and .describe() confirm types, missingness, and value ranges.
# Load Rossmann Store Sales dataset (train.csv) :contentReference[oaicite:0]{index=0}
df_sales = pd.read_csv("train.csv")
df_store = pd.read_csv("store.csv")
# Merge store attributes
df = df_sales.merge(df_store, on="Store", how="left")
# Quick inspection
print(df.head())
print(df.info())
print(df.describe())
Preprocessing & Feature Engineering
- Convert Date to datetime and extract Year, Month, and DayOfWeek.
- One‑hot encode StoreType, Assortment, and StateHoliday.
- Impute missing CompetitionDistance with its maximum to denote “far” competitors.
- Filter to days when stores were open and sales > 0.
- Define our feature set containing competition, promotion flag, calendar, and dummies.
# Convert date and extract calendar features
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['DayOfWeek'] = df['Date'].dt.weekday
# One‑hot encode categorical features
df = pd.get_dummies(df,
columns=['StoreType','Assortment','StateHoliday'],
drop_first=True)
# Fill missing competition distance with a large value
df['CompetitionDistance'].fillna(df['CompetitionDistance'].max(), inplace=True)
# Define predictors and target (filter only open days with sales > 0)
df = df[(df['Open']==1) & (df['Sales']>0)]
features = [
'CompetitionDistance','Promo','DayOfWeek','Month','Year'
] + [c for c in df.columns if c.startswith('StoreType_')
or c.startswith('Assortment_')
or c.startswith('StateHoliday_')]
data = df[features + ['Sales']]
Train/Test Split
We randomly hold out 20% of data for evaluation, ensuring our quantile models generalize.
train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each quantile (10th, 50th, 90th):
- We build a formula linking Sales to all predictors.
- We fit a QuantReg model at that percentile.
- The printed coefficient table shows how each feature’s effect varies across the sales distribution (e.g., Promo might boost the 90th‐percentile sales more than the median).
quantiles = [0.1, 0.5, 0.9]
results = {}
formula = "Sales ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile ---")
print(res.summary().tables[1])
Evaluation with Pinball Loss
We predict on the test set and compute pinball loss—the proper scoring rule for quantile forecasts—quantifying the average weighted error and enabling fair comparison across quantiles.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Sales'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
Quantile regression uncovers the heterogeneous impacts of store and calendar features across the sales distribution. For instance, promotions may yield small median sales lifts but substantial increases in the upper tail of sales. By modeling the 10th, 50th, and 90th percentiles, retailers gain a distribution‑aware forecast: provisioning inventory for slow days, forecasting typical demand, and scaling up for peak periods. These insights empower supply‑chain and merchandising teams to optimize stocking, pricing, and promotional strategies under uncertainty.