Urban Delivery Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
City logistics operators need an accurate prediction of per‐delivery costs to set dynamic pricing, allocate driver resources, and optimize routing.
In this urban delivery cost prediction ML project, we will predict the delivery cost (USD) for individual urban shipments based on features such as trip distance, parcel weight, number of stops, traffic congestion level, vehicle type (bike vs. van), and time of day.
By applying stepwise regression, we’ll pinpoint the strongest cost drivers and build a compact linear model that balances interpretability with predictive accuracy—enabling logistics planners to budget and price urban deliveries more effectively.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Ordinary Least Squares regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Cost Prediction for a Logistics Company
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We import shipment‑level data—featuring distance, weight, stops, traffic level, vehicle type, pickup hour, and actual cost—and inspect its schema and summary statistics to understand distributions and identify missing values.
# Block 1: Load dataset
# Using the “Cost Prediction for Logistic Company” dataset as a proxy
df = pd.read_csv("train.csv")
# Inspect structure and summary statistics
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
Categorical fields (traffic_level, vehicle_type) are one‑hot encoded. We drop incomplete records to ensure model integrity, then split into predictors (X) and response (y) for an 80/20 train/test partition.
# Block 2: Feature engineering & cleaning
# Assume columns: 'distance_km', 'weight_kg', 'num_stops',
# 'traffic_level' (Low/Medium/High), 'vehicle_type' (Bike/Van), 'pickup_hour', target 'cost_usd'
# One‑hot encode categorical variables
df_enc = pd.get_dummies(df,
columns=['traffic_level', 'vehicle_type'],
drop_first=True)
# Drop any rows with missing or invalid values
df_enc = df_enc.dropna()
# Define predictors and target
X = df_enc.drop('cost_usd', axis=1)
y = df_enc['cost_usd']
# Split into training and test sets (80%/20%)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function iteratively performs:
- Forward inclusion: adds the excluded predictor with the lowest p -value below 0.01.
- Backward elimination: removes the included predictor with the highest p -value above 0.05.
Iteration repeats until no further changes occur, yielding a concise set of statistically significant features.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test adding each excluded predictor
excluded = [col for col in X.columns if col not in included]
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:25} p-value {best_pval:.4f}")
# Backward step: test removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals_incl = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals_incl.max()
if worst_pval > threshold_out:
worst_var = pvals_incl.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Using statsmodels, we fit an Ordinary Least Squares regression on the selected variables.
- The printed .summary() provides coefficient estimates (cost impact per unit change), p -values (significance), R², and diagnostic statistics (AIC, F‑statistic), facilitating interpretation of each driver’s effect.
- Predictions on the held‑out test set yield R² (variance explained) and RMSE (root‑mean‑square error), quantifying out‑of‑sample performance.
# Block 4: Perform stepwise feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
- A residuals‑vs‑predicted plot checks for heteroscedasticity or systematic patterns,
- Furthermore, validating core linear regression assumptions and highlighting any outliers or model deficiencies.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Delivery Cost")
plt.show()
Summary
Applying stepwise regression to urban delivery data isolates the key cost drivers—such as distance, weight, traffic level, number of stops, and vehicle type—while pruning redundant variables.
Although the final linear model strikes a strong balance between interpretability (few, significant predictors) and predictive accuracy (high test‑set R², low RMSE),
Hence, giving logistics planners a transparent, data‑driven tool to forecast delivery costs and optimize urban distribution strategies.