Retail Sales Growth Prediction with Polynomial Regression in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
Retail planners need to predict next week’s sales growth (%) for each store before promotions go live and supply orders are placed. Historical data reveal that week-over-week sales growth depends nonlinearly on prior sales levels, promotional flags, competitor distance, holiday indicators, and store characteristics. A straight-line model fails to capture saturation effects (e.g. diminishing returns on promotions), while an unconstrained polynomial overfits noise. By using Polynomial Regression—i.e., linear regression on expanded polynomial and interaction features—alongside Ridge regularisation, we can model these smooth nonlinearities and deliver accurate, interpretable growth forecasts for inventory planning and marketing optimisation.
Dataset
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd # data manipulation import numpy as np # numerical computations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualisation from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Load Data & Libraries
import pandas as pd
import numpy as np
# Load raw data
sales = pd.read_csv("data/train.csv", parse_dates=["Date"])
stores = pd.read_csv("data/store.csv")
# Merge store metadata
df = sales.merge(stores, on="Store", how="left")
# Sort by Store and Date
df.sort_values(["Store","Date"], inplace=True)
3. Feature Engineering
- PolynomialFeatures: expands scaled numerical inputs (previous sales, distance, calendar) and one-hot encoded dummies (promo, holidays) into squared and interaction terms—e.g. Sales_prev_week², Sales_prev_week×Promo.
- Growth target: week-over-week percentage change in sales per store, capturing momentum and saturation.
# Compute week-over-week growth for each store
df["Sales_prev_week"] = (df
.groupby("Store")["Sales"]
.shift(7))
df["Growth"] = (df["Sales"] - df["Sales_prev_week"]) / df["Sales_prev_week"]
df = df.dropna(subset=["Growth"])
# Calendar features
df["WeekOfYear"] = df["Date"].dt.isocalendar().week
df["Year"] = df["Date"].dt.year
# Select predictors
X = df[[
"Sales_prev_week","Promo","Promo2","CompetitionDistance",
"StateHoliday","SchoolHoliday","WeekOfYear","Year"
]]
y = df["Growth"]
4. Build Polynomial Regression Pipeline
StandardScaler: normalises numeric inputs so the Ridge penalty treats them equally.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer
# Separate numerical and categorical columns
num_cols = ["Sales_prev_week","CompetitionDistance","WeekOfYear","Year"]
cat_cols = ["Promo","Promo2","StateHoliday","SchoolHoliday"]
preprocessor = ColumnTransformer([
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(drop="first"), cat_cols)
])
pipe = Pipeline([
("prep", preprocessor),
("poly", PolynomialFeatures(include_bias=False)),
("ridge", Ridge(random_state=42))
])
5. Train/Test Split & Hyperparameter Search
GridSearchCV: tunes polynomial degree (1–3) and regularisation strength α (10⁻³ to 10³) using 5-fold CV to minimise RMSE on growth-rate predictions.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
"poly__degree": [1, 2, 3],
"ridge__alpha": np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring="neg_root_mean_squared_error",
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
6. Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE (growth rate): {rmse:.4f}")
print(f"Test R² : {r2:.3f}")
7. Inspect Key Polynomial Coefficients
Coefficient inspection: reveals which nonlinear or interaction features—such as squared prior sales or cross-terms with promo flags—most strongly drive predicted growth, offering actionable levers (e.g. optimal promo timing relative to past sales).
# Retrieve polynomial feature names
poly = gs.best_estimator_.named_steps["poly"]
# Get names from the preprocessing step
prep = gs.best_estimator_.named_steps["prep"]
num_features = prep.transformers_[0][2]
cat_features = prep.named_transformers_["cat"].get_feature_names_out(cat_cols)
all_inputs = np.hstack([num_features, cat_features])
feat_names = poly.get_feature_names_out(input_features=all_inputs)
coefs = gs.best_estimator_.named_steps["ridge"].coef_
# Plot top 10 by absolute value
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind="barh")
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Sales Growth")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()
Summary
By integrating polynomial feature engineering with Ridge regularisation into a robust pipeline, we deliver:
- Accurate nonlinear forecasts of retail sales growth (low RMSE, high R²).
- Controlled model complexity via α tuning, avoiding spurious high-order effects.
- Interpretable insights: top polynomial features guide decisions on promotion timing, inventory allocation, and competitive positioning to maximise next-week sales growth.