Retail Sales Volume Prediction using Linear Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Retail chains track an avalanche of data—daily sales, promotions, holidays, weather, price changes—yet most stores still react after the numbers come in.

Our goal is to predict next-day sales volume (units sold) for every product-store combination using a transparent linear-regression baseline. A reliable same‑day forecast lets inventory planners correct stock levels, avoid lost sales, and tighten cashflow long before the nightly report lands.

Libraries Required

  • pandas # tidy data handling
  • numpy # fast maths
  • matplotlib.pyplot # sanity‑check visuals
  • seaborn # quick correlation plots (optional)
  • scikit‑learn # model, split, metrics
  • joblib # save the trained model

Dataset Link

Store Sales – Time Series Forecasting

Step by Step Code Implementation

Why linear regression?

Within normal operating ranges, price cuts, promotions, and calendar effects often have an almost linear first‑order impact on unit sales. Starting simple gives an interpretable yardstick before exploring richer algorithms.

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns                     # optional
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

Load & glimpse the data

df = pd.read_csv("store_sales/train.csv")
print(df.head())
print(df.info())

Basic cleaning

df = df.dropna()                          # pruning rows with missing cells

Feature engineering

Shoppers behave differently on weekends, end‑of‑month, or during holiday run‑ups. Deriving weekday, month, and ISO week captures that cyclic character in seconds.

1. Calendar features

df['date']        = pd.to_datetime(df['date'])
df['dayofweek']   = df['date'].dt.dayofweek          # 0‑Mon … 6‑Sun
df['month']       = df['date'].dt.month
df['year']        = df['date'].dt.year
df['weekofyear']  = df['date'].dt.isocalendar().week

2. Promotion flag is already present in the dataset

#    (rename for readability)
df = df.rename(columns={'onpromotion': 'promo_flag'})

One‑hot IDs

Treating store_nbr and item_nbr as unrestricted numeric values would imply ordinal spacing that does not exist. One‑hot encoding flips them into neutral binary flags, letting the model learn a clean intercept per item or store.

num_features = ['promo_flag', 'dayofweek', 'month', 'weekofyear']
cat_features = ['store_nbr', 'item_nbr']              # treat as categories

# one‑hot encode categorical ids
df_enc = pd.get_dummies(df[cat_features], prefix=cat_features, drop_first=True)

X = pd.concat([df[num_features], df_enc], axis=1)
y = df['unit_sales']

Train‑test split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

 Model training

linreg = LinearRegression(n_jobs=-1)
linreg.fit(X_train, y_train)

Evaluation

R² expresses how much variance our features explain; MAE expresses typical error in the same units buyers care about.

y_pred = linreg.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} units")

 Coefficient insight

Looking at the top positive and negative terms instantly reveals which stores, items, or promotions drive demand spikes or dips, a priceless cue for merchandising teams.

coef_df = (pd.DataFrame({
              'feature': X_train.columns,
              'coef'   : linreg.coef_
           })
           .sort_values('coef', ascending=False))

print(coef_df.head(10))      # top positive drivers
print(coef_df.tail(10))      # strongest negative drivers

Persist the model

joblib saves both coefficients and one‑hot column order, so tomorrow’s batch job can call joblib.load and score fresh data without retraining.

joblib.dump(linreg, "retail_sales_linreg.pkl")

Summary

This walkthrough shows how to distill raw point‑of‑sale logs into an actionable next‑day volume forecast with nothing fancier than linear regression. Even this lightweight model surfaces key levers—promotions, weekend patterns, seasonal cycles—while delivering a numeric margin of error that planners can fold into safety‑stock rules. Keep the cleaning and feature‑generation pipeline, swap in more expressive models (regularised regressors, gradient‑boosted trees, even deep nets) when you need tighter forecasts, and the insights gained here will remain your benchmark for judging real uplift.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *