Retail Sales Volume Prediction using Linear Regression in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
Retail chains track an avalanche of data—daily sales, promotions, holidays, weather, price changes—yet most stores still react after the numbers come in.
Our goal is to predict next-day sales volume (units sold) for every product-store combination using a transparent linear-regression baseline. A reliable same‑day forecast lets inventory planners correct stock levels, avoid lost sales, and tighten cashflow long before the nightly report lands.
Libraries Required
- pandas # tidy data handling
- numpy # fast maths
- matplotlib.pyplot # sanity‑check visuals
- seaborn # quick correlation plots (optional)
- scikit‑learn # model, split, metrics
- joblib # save the trained model
Dataset Link
Store Sales – Time Series Forecasting
Step by Step Code Implementation
Why linear regression?
Within normal operating ranges, price cuts, promotions, and calendar effects often have an almost linear first‑order impact on unit sales. Starting simple gives an interpretable yardstick before exploring richer algorithms.
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # optional from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
Load & glimpse the data
df = pd.read_csv("store_sales/train.csv")
print(df.head())
print(df.info())
Basic cleaning
df = df.dropna() # pruning rows with missing cells
Feature engineering
Shoppers behave differently on weekends, end‑of‑month, or during holiday run‑ups. Deriving weekday, month, and ISO week captures that cyclic character in seconds.
1. Calendar features
df['date'] = pd.to_datetime(df['date']) df['dayofweek'] = df['date'].dt.dayofweek # 0‑Mon … 6‑Sun df['month'] = df['date'].dt.month df['year'] = df['date'].dt.year df['weekofyear'] = df['date'].dt.isocalendar().week
2. Promotion flag is already present in the dataset
# (rename for readability)
df = df.rename(columns={'onpromotion': 'promo_flag'})
One‑hot IDs
Treating store_nbr and item_nbr as unrestricted numeric values would imply ordinal spacing that does not exist. One‑hot encoding flips them into neutral binary flags, letting the model learn a clean intercept per item or store.
num_features = ['promo_flag', 'dayofweek', 'month', 'weekofyear'] cat_features = ['store_nbr', 'item_nbr'] # treat as categories # one‑hot encode categorical ids df_enc = pd.get_dummies(df[cat_features], prefix=cat_features, drop_first=True) X = pd.concat([df[num_features], df_enc], axis=1) y = df['unit_sales']
Train‑test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
Model training
linreg = LinearRegression(n_jobs=-1) linreg.fit(X_train, y_train)
Evaluation
R² expresses how much variance our features explain; MAE expresses typical error in the same units buyers care about.
y_pred = linreg.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} units")
Coefficient insight
Looking at the top positive and negative terms instantly reveals which stores, items, or promotions drive demand spikes or dips, a priceless cue for merchandising teams.
coef_df = (pd.DataFrame({
'feature': X_train.columns,
'coef' : linreg.coef_
})
.sort_values('coef', ascending=False))
print(coef_df.head(10)) # top positive drivers
print(coef_df.tail(10)) # strongest negative drivers
Persist the model
joblib saves both coefficients and one‑hot column order, so tomorrow’s batch job can call joblib.load and score fresh data without retraining.
joblib.dump(linreg, "retail_sales_linreg.pkl")
Summary
This walkthrough shows how to distill raw point‑of‑sale logs into an actionable next‑day volume forecast with nothing fancier than linear regression. Even this lightweight model surfaces key levers—promotions, weekend patterns, seasonal cycles—while delivering a numeric margin of error that planners can fold into safety‑stock rules. Keep the cleaning and feature‑generation pipeline, swap in more expressive models (regularised regressors, gradient‑boosted trees, even deep nets) when you need tighter forecasts, and the insights gained here will remain your benchmark for judging real uplift.