Retail Logistics Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Accurately predicting per‑shipment logistics costs is important for retail chains to set pricing, control margins, and optimize routing.

In this retail logistics cost prediction project, we’ll predict the delivery cost (Cost_USD) for individual retail shipments based on features such as distance, weight, volume, vehicle type, service level (standard vs. expedited), and day of the week.

We’ll employ stepwise linear regression to select the most significant predictors and build an interpretable model, helping operations teams budget and price deliveries more precisely

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Cost Prediction for Logistic Company

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load shipment‑level data—distance, weight, volume, vehicle type, service level, shipment date, and actual cost—and inspect its structure to understand distributions and missingness.

# Block 1: Load dataset
# Kaggle Competition: Cost Prediction for Logistic Company :contentReference[oaicite:1]{index=1}  
df = pd.read_csv("train.csv")  

# Inspect the data
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We extract the weekday from the shipment date to capture temporal effects. Categorical fields (Vehicle_Type, Service_Level) are one‑hot encoded.
We drop the original date column and any incomplete records, then split into predictors (X) and response (y) for an 80/20 train/test split.

# Block 2: Feature engineering & cleaning
# Assume columns: 'Distance_km', 'Weight_kg', 'Volume_m3', 'Vehicle_Type', 'Service_Level', 'Shipment_Date', 'Cost_USD'

# Convert date to day-of-week
df['Shipment_Date'] = pd.to_datetime(df['Shipment_Date'])
df['DayOfWeek'] = df['Shipment_Date'].dt.weekday

# One-hot encode categorical variables
df_enc = pd.get_dummies(df, columns=['Vehicle_Type','Service_Level'], drop_first=True)

# Drop unused columns & missing rows
df_enc = df_enc.drop(columns=['Shipment_Date']).dropna()

# Define predictors and target
X = df_enc.drop('Cost_USD', axis=1)
y = df_enc['Cost_USD']

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function implements a hybrid forward‑backward algorithm:

Forward inclusion: Adds the excluded feature with the smallest p‑value below 0.01.
Backward elimination: Removes the included feature with the largest p‑value above 0.05.

The process iterates until no further variables meet entry or removal criteria, yielding a concise, statistically robust feature set.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step
        excluded = list(set(X.columns) - set(included))
        pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            pvals[col] = model.pvalues[col]
        best_pval = pvals.min()
        if best_pval < threshold_in:
            best_var = pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals_in = model.pvalues.iloc[1:]
        worst_pval = pvals_in.max()
        if worst_pval > threshold_out:
            worst_var = pvals_in.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Using the selected predictors, we fit an Ordinary Least Squares regression with statsmodels.
The .summary() output provides coefficient estimates (cost impact per unit change), standard errors, p‑values (significance), R², and diagnostic statistics—enabling clear interpretation of each driver’s effect.
On the held‑out test set, we compute R² (explained variance) and RMSE (root‑mean‑square error) to quantify predictive accuracy on unseen data.

# Block 4: Feature selection
selected = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict & evaluate
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)

print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A plot of residuals versus predicted costs checks for non‑random patterns or heteroscedasticity.
Hence, validating core linear regression assumptions and highlighting potential model improvements.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Delivery Cost")
plt.show()

Summary

By applying stepwise regression to retail logistics data, we isolate the most influential cost drivers—such as distance, weight, service level, and day‑of‑week effects—while pruning redundant features.

Moreover, the resulting linear model strikes a strong balance between interpretability (few, statistically significant predictors) and predictive performance (high test‑set R², low RMSE).

Thus, equipping logistics planners and pricing teams with a transparent, data‑driven tool to forecast delivery costs and optimize network operations.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook