Urban Delivery Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

City logistics operators need an accurate prediction of per‐delivery costs to set dynamic pricing, allocate driver resources, and optimize routing.

In this urban delivery cost prediction ML project, we will predict the delivery cost (USD) for individual urban shipments based on features such as trip distance, parcel weight, number of stops, traffic congestion level, vehicle type (bike vs. van), and time of day.

By applying stepwise regression, we’ll pinpoint the strongest cost drivers and build a compact linear model that balances interpretability with predictive accuracy—enabling logistics planners to budget and price urban deliveries more effectively.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Cost Prediction for a Logistics Company

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import shipment‑level data—featuring distance, weight, stops, traffic level, vehicle type, pickup hour, and actual cost—and inspect its schema and summary statistics to understand distributions and identify missing values.

# Block 1: Load dataset
# Using the “Cost Prediction for Logistic Company” dataset as a proxy   
df = pd.read_csv("train.csv")

# Inspect structure and summary statistics
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Categorical fields (traffic_level, vehicle_type) are one‑hot encoded. We drop incomplete records to ensure model integrity, then split into predictors (X) and response (y) for an 80/20 train/test partition.

# Block 2: Feature engineering & cleaning
# Assume columns: 'distance_km', 'weight_kg', 'num_stops',
# 'traffic_level' (Low/Medium/High), 'vehicle_type' (Bike/Van), 'pickup_hour', target 'cost_usd'

# One‑hot encode categorical variables
df_enc = pd.get_dummies(df, 
                        columns=['traffic_level', 'vehicle_type'], 
                        drop_first=True)

# Drop any rows with missing or invalid values
df_enc = df_enc.dropna()

# Define predictors and target
X = df_enc.drop('cost_usd', axis=1)
y = df_enc['cost_usd']

# Split into training and test sets (80%/20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function iteratively performs:

Forward inclusion: adds the excluded predictor with the lowest p -value below 0.01.
Backward elimination: removes the included predictor with the highest p -value above 0.05.

Iteration repeats until no further changes occur, yielding a concise set of statistically significant features.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test adding each excluded predictor
        excluded = [col for col in X.columns if col not in included]
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step: test removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals_incl = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals_incl.max()
        if worst_pval > threshold_out:
            worst_var = pvals_incl.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Using statsmodels, we fit an Ordinary Least Squares regression on the selected variables.
The printed .summary() provides coefficient estimates (cost impact per unit change), p -values (significance), R², and diagnostic statistics (AIC, F‑statistic), facilitating interpretation of each driver’s effect.
Predictions on the held‑out test set yield R² (variance explained) and RMSE (root‑mean‑square error), quantifying out‑of‑sample performance.

# Block 4: Perform stepwise feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residuals‑vs‑predicted plot checks for heteroscedasticity or systematic patterns,
Furthermore, validating core linear regression assumptions and highlighting any outliers or model deficiencies.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Delivery Cost")
plt.show()

Summary

Applying stepwise regression to urban delivery data isolates the key cost drivers—such as distance, weight, traffic level, number of stops, and vehicle type—while pruning redundant variables.

Although the final linear model strikes a strong balance between interpretability (few, significant predictors) and predictive accuracy (high test‑set R², low RMSE),

Hence, giving logistics planners a transparent, data‑driven tool to forecast delivery costs and optimize urban distribution strategies.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook