Logistics Delivery Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Logistics providers must calculate accurate delivery costs for every shipment to give perfect customer quotes, provide resources, and maintain healthy margins. These costs depend on shipment attributes (distance, weight, volume), vehicle type, service level (standard vs. expedited), and temporal factors (day of week, peak season).

In this project, we will predict the delivery cost for individual freight shipments by fitting a linear regression model with stepwise feature selection—identifying the most impactful variables and delivering an interpretable model that helps operations teams budget and price deliveries effectively.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Cost Prediction for a Logistics Company

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load the logistics cost dataset containing shipment-level features and inspect its structure to understand variable types and distributions.

# Block 1: Load dataset
# Kaggle Competition: Cost Prediction for Logistic Company  
df = pd.read_csv("train.csv")  # downloaded via Kaggle API or local path

# Inspect basic structure
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Convert Shipment_Date to extract DayOfWeek for temporal patterns.
One‑hot encode Vehicle_Type and Service_Level.
Drop the original date and any incomplete records.
Split into predictors (X) and response (y), then into training and test sets.

# Block 2: Clean & encode
# Assume columns: 'Distance_km', 'Weight_kg', 'Volume_m3',
# 'Vehicle_Type', 'Service_Level', 'Shipment_Date', target 'Cost_USD'

# Extract day-of-week to capture temporal effects
df['Shipment_Date'] = pd.to_datetime(df['Shipment_Date'])
df['DayOfWeek'] = df['Shipment_Date'].dt.weekday

# One‑hot encode categorical variables
df_enc = pd.get_dummies(df,
                        columns=['Vehicle_Type','Service_Level'],
                        drop_first=True)

# Drop unused columns and any rows with missing values
df_enc = df_enc.drop(columns=['Shipment_Date']).dropna()

# Define predictors and target
X = df_enc.drop('Cost_USD', axis=1)
y = df_enc['Cost_USD']

# Train–test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function performs a hybrid forward‑backward algorithm:

Forward inclusion: adds the excluded predictor with the lowest p‑value below 0.01.
Backward elimination: removes the included predictor with the highest p‑value above 0.05.

Repeats until no variables ahead meet entry/removal criteria, yielding a succinct set of significant cost drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: consider adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.4f}")

        # Backward step: consider removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Using statsmodels, we fit an Ordinary Least Squares regression on selected features and review coefficient estimates, p‑values, R², and other diagnostics to interpret each variable’s impact.
We depict costs on the held‑out test set and compute R² (explained variance) and RMSE (prediction error), quantifying out‑of‑sample performance.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

Plotting residuals versus predicted values checks for patterns or heteroscedasticity, validating core assumptions of linear regression, and ensuring model reliability.

# Block 5: Residual plot to check linear assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Delivery Cost")
plt.show()

Summary

Applying stepwise regression to freight cost data isolates the most influential predictors—such as distance, shipment weight, vehicle type, service level, and day‑of‑week effects—while pruning redundant variables.

The resulting linear model balances interpretability (through clear coefficient estimates and significance tests) with predictive accuracy (high R², low RMSE), providing logistics planners a transparent, data‑driven tool to forecast delivery costs and optimize pricing and routing strategies.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook