Retail Inventory Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Effective inventory management is crucial for retailers to minimise holding costs and avoid stock‑outs. Here, we aim to predict the inventory holding cost per SKU based on item attributes (unit cost, category, shelf life), sales velocity (average weekly units sold), and store location factors (store size, regional economic index). By employing stepwise regression, we will pinpoint the most influential drivers of inventory cost and build a lean linear model that balances interpretability with predictive accuracy—enabling supply‑chain managers to forecast holding expenses and optimise stocking strategies.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a combined retail‑inventory dataset that includes unit costs, categories, inventory levels, and inventory value per SKU. Initial calls to .info() and .describe() confirm the structure and ranges.
# Block 1: Load dataset
# Retail Inventory Data – Mendeley (includes Inventory Value and Units) :contentReference[oaicite:0]{index=0}
df = pd.read_csv("https://data.mendeley.com/public-files/datasets/9pn955p5vj/1/files/retail_inventory.csv")
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
- Compute a weekly holding cost per unit assuming a 20% annual holding rate.
- Multiply by inventory units to derive total holding cost per SKU.
- One‑hot encode the product category.
- Define predictors (X), including cost, shelf life, sales rate, store size, region index, and category dummies; the response (y) is the total holding cost.
- Split into training and test sets for unbiased evaluation.
# Block 2: Clean & engineer features
# Assume columns: 'SKU', 'Store_ID', 'Unit_Cost', 'Category',
# 'Shelf_Life_Days', 'Avg_Weekly_Sold', 'Store_Size_sqft', 'Region_Index',
# 'Inventory_Units', 'Inventory_Value'
# Compute holding cost per unit (e.g., 20% annual holding rate prorated to period)
df['Holding_Rate'] = 0.20 / 52 # 20% per year → weekly
df['Holding_Cost_Per_Unit'] = df['Unit_Cost'] * df['Holding_Rate']
# Compute total holding cost per SKU
df['Holding_Cost'] = df['Inventory_Units'] * df['Holding_Cost_Per_Unit']
# One‑hot encode category
df_enc = pd.get_dummies(df, columns=['Category'], drop_first=True)
# Define predictors and target
X = df_enc[[
'Unit_Cost','Shelf_Life_Days','Avg_Weekly_Sold',
'Store_Size_sqft','Region_Index'
] + [col for col in df_enc.columns if col.startswith('Category_')]]
y = df_enc['Holding_Cost']
# Split data (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
We implement a hybrid forward-backwards algorithm:
- Forward inclusion: add excluded predictors with p < 0.01.
- Backward elimination: remove included predictors with p > 0.05. Iteration continues until no further variables qualify for addition or removal.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:25} p-value {best_pval:.4f}")
# Backward step
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The summary() output provides coefficient estimates (cost impact per unit change), p-values (significance), R², and diagnostic statistics (AIC and F-statistic).
- Predictions on the held‑out test set yield R² (explained variance) and RMSE (prediction error), quantifying generalisation performance.
# Block 4: Feature selection
selected = stepwise_selection(X_train, y_train)
# Fit final model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
# Performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residual vs. predicted plot checks for heteroscedasticity or systematic patterns, validating linear model assumptions.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Holding Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Holding Cost")
plt.show()
Summary
By applying stepwise regression to retail‑inventory data, we isolate the key cost drivers—such as unit cost, inventory turnover (as measured by average weekly sales), shelf life, and category effects—while pruning redundant variables. The resulting linear model balances transparency (few, statistically significant predictors) with strong predictive accuracy (high test‑set R², low RMSE), equipping supply‑chain managers with a reliable tool to forecast inventory holding costs and optimise stocking decisions.