Retail Inventory Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Effective inventory management is crucial for retailers to minimise holding costs and avoid stock‑outs. Here, we aim to predict the inventory holding cost per SKU based on item attributes (unit cost, category, shelf life), sales velocity (average weekly units sold), and store location factors (store size, regional economic index). By employing stepwise regression, we will pinpoint the most influential drivers of inventory cost and build a lean linear model that balances interpretability with predictive accuracy—enabling supply‑chain managers to forecast holding expenses and optimise stocking strategies.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Retail Inventory Data

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a combined retail‑inventory dataset that includes unit costs, categories, inventory levels, and inventory value per SKU. Initial calls to .info() and .describe() confirm the structure and ranges.

# Block 1: Load dataset  
# Retail Inventory Data – Mendeley (includes Inventory Value and Units) :contentReference[oaicite:0]{index=0}  
df = pd.read_csv("https://data.mendeley.com/public-files/datasets/9pn955p5vj/1/files/retail_inventory.csv")

print(df.head())  
print(df.info())  
print(df.describe())

Data Preprocessing

Compute a weekly holding cost per unit assuming a 20% annual holding rate.
Multiply by inventory units to derive total holding cost per SKU.
One‑hot encode the product category.
Define predictors (X), including cost, shelf life, sales rate, store size, region index, and category dummies; the response (y) is the total holding cost.
Split into training and test sets for unbiased evaluation.

# Block 2: Clean & engineer features  
# Assume columns: 'SKU', 'Store_ID', 'Unit_Cost', 'Category',  
# 'Shelf_Life_Days', 'Avg_Weekly_Sold', 'Store_Size_sqft', 'Region_Index',  
# 'Inventory_Units', 'Inventory_Value'

# Compute holding cost per unit (e.g., 20% annual holding rate prorated to period)
df['Holding_Rate'] = 0.20 / 52  # 20% per year → weekly  
df['Holding_Cost_Per_Unit'] = df['Unit_Cost'] * df['Holding_Rate']

# Compute total holding cost per SKU
df['Holding_Cost'] = df['Inventory_Units'] * df['Holding_Cost_Per_Unit']

# One‑hot encode category
df_enc = pd.get_dummies(df, columns=['Category'], drop_first=True)

# Define predictors and target
X = df_enc[[
    'Unit_Cost','Shelf_Life_Days','Avg_Weekly_Sold',
    'Store_Size_sqft','Region_Index'
] + [col for col in df_enc.columns if col.startswith('Category_')]]
y = df_enc['Holding_Cost']

# Split data (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

We implement a hybrid forward-backwards algorithm:

Forward inclusion: add excluded predictors with p < 0.01.
Backward elimination: remove included predictors with p > 0.05. Iteration continues until no further variables qualify for addition or removal.

# Block 3: Forward–backward stepwise selection  
def stepwise_selection(X, y,  
                       initial_list=[],  
                       threshold_in=0.01,  
                       threshold_out=0.05,  
                       verbose=True):  
    included = list(initial_list)  
    while True:  
        changed = False  

        # Forward step  
        excluded = list(set(X.columns) - set(included))  
        new_pvals = pd.Series(index=excluded, dtype=float)  
        for col in excluded:  
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()  
            new_pvals[col] = model.pvalues[col]  
        best_pval = new_pvals.min()  
        if best_pval < threshold_in:  
            best_var = new_pvals.idxmin()  
            included.append(best_var)  
            changed = True  
            if verbose:  
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")  

        # Backward step  
        model = sm.OLS(y, sm.add_constant(X[included])).fit()  
        pvals = model.pvalues.iloc[1:]  # exclude intercept  
        worst_pval = pvals.max()  
        if worst_pval > threshold_out:  
            worst_var = pvals.idxmax()  
            included.remove(worst_var)  
            changed = True  
            if verbose:  
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")  

        if not changed:  
            break  

    return included

Model Building & Evaluation

Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The summary() output provides coefficient estimates (cost impact per unit change), p-values (significance), R², and diagnostic statistics (AIC and F-statistic).
Predictions on the held‑out test set yield R² (explained variance) and RMSE (prediction error), quantifying generalisation performance.

# Block 4: Feature selection  
selected = stepwise_selection(X_train, y_train)  

# Fit final model  
X_train_sel = sm.add_constant(X_train[selected])  
model = sm.OLS(y_train, X_train_sel).fit()  
print(model.summary())  

# Predict on test set  
X_test_sel = sm.add_constant(X_test[selected])  
y_pred = model.predict(X_test_sel)  

# Performance metrics  
print("Test R²:", r2_score(y_test, y_pred))  
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual vs. predicted plot checks for heteroscedasticity or systematic patterns, validating linear model assumptions.

# Block 5: Residual plot  
residuals = y_test - y_pred  
plt.scatter(y_pred, residuals)  
plt.axhline(0, linestyle="--")  
plt.xlabel("Predicted Holding Cost")  
plt.ylabel("Residuals")  
plt.title("Residuals vs. Predicted Holding Cost")  
plt.show()

Summary

By applying stepwise regression to retail‑inventory data, we isolate the key cost drivers—such as unit cost, inventory turnover (as measured by average weekly sales), shelf life, and category effects—while pruning redundant variables. The resulting linear model balances transparency (few, statistically significant predictors) with strong predictive accuracy (high test‑set R², low RMSE), equipping supply‑chain managers with a reliable tool to forecast inventory holding costs and optimise stocking decisions.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook