Retail Profit Prediction using Stepwise Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Retailers need to identify the factors that help them maximise profit, such as product category, discount levels, sales channel, and regional characteristics. In this project, we will examine transaction-level profits in a global superstore dataset using a linear regression model and a stepwise selection process to identify the most influential predictors. The resulting model will help retail managers find key levers (e.g., discount strategies or product mixes) to optimize profit margins.

Libraries Required

import pandas as pd               # Data manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Statistical modeling  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation  
import matplotlib.pyplot as plt   # Visualization  

Dataset

Global Superstore Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load the Global Superstore CSV, which includes transaction‑level details such as Sales, Discount, and Profit, and examine its structure and summary statistics to understand the ranges and distributions. (Kaggle)

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/fatihilhan/global-superstore-dataset/download"
df = pd.read_csv(url)

# Inspect data
print(df.head())
print(df.info())
print(df.describe())

The Global Superstore dataset contains ~50,000 rows with fields such as Order Date, Region, Category, Sub‑Category, Sales, Discount, and Profit.

Data Preprocessing

Categorical features (Region, Category, Sub‑Category) are converted to dummy variables. We drop any incomplete records to ensure a clean modelling pipeline. The target variable, Profit, is separated from the predictors, and the dataset is split 80/20 into training and test sets.

# Block 2: One‑hot encode categorical variables
df_enc = pd.get_dummies(
    df,
    columns=["Region", "Category", "Sub-Category"],
    drop_first=True
)

# Drop rows with missing values (if any)
df_enc = df_enc.dropna()

# Define features and target
X = df_enc.drop("Profit", axis=1)
y = df_enc["Profit"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

We implement a hybrid forward-backwards procedure: in each iteration, we add the excluded predictor with the lowest p‑value below 0.01 and remove the included predictor with the highest p‑value above 0.05. Iteration stops when no further additions or removals occur.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: add best candidate
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_feature = new_pvals.idxmin()
            included.append(best_feature)
            changed = True
            if verbose:
                print(f"Add  {best_feature:30} p-value {best_pval:.6f}")

        # Backward step: remove worst
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvalues = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvalues.max()
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
            if verbose:
                print(f"Drop {worst_feature:30} p-value {worst_pval:.6f}")

        if not changed:
            break

    return included

Model Building & Evaluation

  • Model Fitting: Using the selected subset of predictors, we fit an Ordinary Least Squares regression via statsmodels. The .summary() output provides coefficient estimates, standard errors, t‑statistics, p‑values, and goodness‑of‑fit metrics (R², F‑statistic).
  • Evaluation: We generate predictions on the held‑out test set and compute R² (explained variance) and RMSE (root‑mean‑square error) to assess model generalisation.
# Block 4: Feature selection
selected_vars = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_vars])
model = sm.OLS(y_train, X_train_sel).fit()

# Print regression summary
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_vars])
y_pred = model.predict(X_test_sel)

# Compute metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

Plotting residuals against predicted profits verifies homoscedasticity (constant variance) and checks for systematic patterns or outliers, validating linear regression assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Profit")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Profit")
plt.show()

Summary

Applying stepwise regression to the Global Superstore data yields a streamlined linear model that highlights the most influential drivers of transaction‑level profit—such as Discount, Sales, and specific product categories—while discarding redundant variables. The final model achieves strong explanatory power (high R²) and low prediction error (RMSE) on unseen data. Retail managers can leverage these insights to refine discount strategies, optimise product assortments, and allocate regional resources more effectively to maximise profitability.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *