Retail Sales Growth Prediction with Polynomial Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Retail planners need to predict next week’s sales growth (%) for each store before promotions go live and supply orders are placed. Historical data reveal that week-over-week sales growth depends nonlinearly on prior sales levels, promotional flags, competitor distance, holiday indicators, and store characteristics. A straight-line model fails to capture saturation effects (e.g. diminishing returns on promotions), while an unconstrained polynomial overfits noise. By using Polynomial Regression—i.e., linear regression on expanded polynomial and interaction features—alongside Ridge regularisation, we can model these smooth nonlinearities and deliver accurate, interpretable growth forecasts for inventory planning and marketing optimisation.

Dataset

Rossmann Store Sales

Step-by-Step Code Implementation

1. Libraries Required

import pandas as pd                                 # data manipulation  
import numpy as np                                  # numerical computations  

import matplotlib.pyplot as plt                     # plotting  
import seaborn as sns                               # enhanced visualisation  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

2. Load Data & Libraries

import pandas as pd
import numpy as np

# Load raw data
sales = pd.read_csv("data/train.csv", parse_dates=["Date"])
stores = pd.read_csv("data/store.csv")

# Merge store metadata
df = sales.merge(stores, on="Store", how="left")

# Sort by Store and Date
df.sort_values(["Store","Date"], inplace=True)

3. Feature Engineering

PolynomialFeatures: expands scaled numerical inputs (previous sales, distance, calendar) and one-hot encoded dummies (promo, holidays) into squared and interaction terms—e.g. Sales_prev_week², Sales_prev_week×Promo.
Growth target: week-over-week percentage change in sales per store, capturing momentum and saturation.

# Compute week-over-week growth for each store
df["Sales_prev_week"] = (df
    .groupby("Store")["Sales"]
    .shift(7))
df["Growth"] = (df["Sales"] - df["Sales_prev_week"]) / df["Sales_prev_week"]
df = df.dropna(subset=["Growth"])

# Calendar features
df["WeekOfYear"] = df["Date"].dt.isocalendar().week
df["Year"]       = df["Date"].dt.year

# Select predictors
X = df[[
    "Sales_prev_week","Promo","Promo2","CompetitionDistance",
    "StateHoliday","SchoolHoliday","WeekOfYear","Year"
]]
y = df["Growth"]

4. Build Polynomial Regression Pipeline

StandardScaler: normalises numeric inputs so the Ridge penalty treats them equally.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer

# Separate numerical and categorical columns
num_cols = ["Sales_prev_week","CompetitionDistance","WeekOfYear","Year"]
cat_cols = ["Promo","Promo2","StateHoliday","SchoolHoliday"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(drop="first"), cat_cols)
])

pipe = Pipeline([
    ("prep", preprocessor),
    ("poly", PolynomialFeatures(include_bias=False)),
    ("ridge", Ridge(random_state=42))
])

5. Train/Test Split & Hyperparameter Search

GridSearchCV: tunes polynomial degree (1–3) and regularisation strength α (10⁻³ to 10³) using 5-fold CV to minimise RMSE on growth-rate predictions.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    "poly__degree": [1, 2, 3],
    "ridge__alpha": np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best parameters:", gs.best_params_)

6. Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE (growth rate): {rmse:.4f}")
print(f"Test R²               : {r2:.3f}")

7. Inspect Key Polynomial Coefficients

Coefficient inspection: reveals which nonlinear or interaction features—such as squared prior sales or cross-terms with promo flags—most strongly drive predicted growth, offering actionable levers (e.g. optimal promo timing relative to past sales).

# Retrieve polynomial feature names
poly = gs.best_estimator_.named_steps["poly"]
# Get names from the preprocessing step
prep = gs.best_estimator_.named_steps["prep"]
num_features = prep.transformers_[0][2]
cat_features = prep.named_transformers_["cat"].get_feature_names_out(cat_cols)
all_inputs = np.hstack([num_features, cat_features])

feat_names = poly.get_feature_names_out(input_features=all_inputs)
coefs = gs.best_estimator_.named_steps["ridge"].coef_

# Plot top 10 by absolute value
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind="barh")
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Sales Growth")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()

Summary

By integrating polynomial feature engineering with Ridge regularisation into a robust pipeline, we deliver:

Accurate nonlinear forecasts of retail sales growth (low RMSE, high R²).
Controlled model complexity via α tuning, avoiding spurious high-order effects.
Interpretable insights: top polynomial features guide decisions on promotion timing, inventory allocation, and competitive positioning to maximise next-week sales growth.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook