Retail Profit Prediction using Stepwise Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Retailers need to identify the factors that help them maximise profit, such as product category, discount levels, sales channel, and regional characteristics. In this project, we will examine transaction-level profits in a global superstore dataset using a linear regression model and a stepwise selection process to identify the most influential predictors. The resulting model will help retail managers find key levers (e.g., discount strategies or product mixes) to optimize profit margins.
Libraries Required
import pandas as pd # Data manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Statistical modeling from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load the Global Superstore CSV, which includes transaction‑level details such as Sales, Discount, and Profit, and examine its structure and summary statistics to understand the ranges and distributions. (Kaggle)
# Block 1: Load dataset url = "https://www.kaggle.com/datasets/fatihilhan/global-superstore-dataset/download" df = pd.read_csv(url) # Inspect data print(df.head()) print(df.info()) print(df.describe())
The Global Superstore dataset contains ~50,000 rows with fields such as Order Date, Region, Category, Sub‑Category, Sales, Discount, and Profit.
Data Preprocessing
Categorical features (Region, Category, Sub‑Category) are converted to dummy variables. We drop any incomplete records to ensure a clean modelling pipeline. The target variable, Profit, is separated from the predictors, and the dataset is split 80/20 into training and test sets.
# Block 2: One‑hot encode categorical variables
df_enc = pd.get_dummies(
df,
columns=["Region", "Category", "Sub-Category"],
drop_first=True
)
# Drop rows with missing values (if any)
df_enc = df_enc.dropna()
# Define features and target
X = df_enc.drop("Profit", axis=1)
y = df_enc["Profit"]
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
We implement a hybrid forward-backwards procedure: in each iteration, we add the excluded predictor with the lowest p‑value below 0.01 and remove the included predictor with the highest p‑value above 0.05. Iteration stops when no further additions or removals occur.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: add best candidate
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_feature = new_pvals.idxmin()
included.append(best_feature)
changed = True
if verbose:
print(f"Add {best_feature:30} p-value {best_pval:.6f}")
# Backward step: remove worst
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvalues = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvalues.max()
if worst_pval > threshold_out:
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
changed = True
if verbose:
print(f"Drop {worst_feature:30} p-value {worst_pval:.6f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: Using the selected subset of predictors, we fit an Ordinary Least Squares regression via statsmodels. The .summary() output provides coefficient estimates, standard errors, t‑statistics, p‑values, and goodness‑of‑fit metrics (R², F‑statistic).
- Evaluation: We generate predictions on the held‑out test set and compute R² (explained variance) and RMSE (root‑mean‑square error) to assess model generalisation.
# Block 4: Feature selection
selected_vars = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_vars])
model = sm.OLS(y_train, X_train_sel).fit()
# Print regression summary
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_vars])
y_pred = model.predict(X_test_sel)
# Compute metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
Plotting residuals against predicted profits verifies homoscedasticity (constant variance) and checks for systematic patterns or outliers, validating linear regression assumptions.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Profit")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Profit")
plt.show()
Summary
Applying stepwise regression to the Global Superstore data yields a streamlined linear model that highlights the most influential drivers of transaction‑level profit—such as Discount, Sales, and specific product categories—while discarding redundant variables. The final model achieves strong explanatory power (high R²) and low prediction error (RMSE) on unseen data. Retail managers can leverage these insights to refine discount strategies, optimise product assortments, and allocate regional resources more effectively to maximise profitability.