Logistics Delivery Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Logistics providers must calculate accurate delivery costs for every shipment to give perfect customer quotes, provide resources, and maintain healthy margins. These costs depend on shipment attributes (distance, weight, volume), vehicle type, service level (standard vs. expedited), and temporal factors (day of week, peak season).
In this project, we will predict the delivery cost for individual freight shipments by fitting a linear regression model with stepwise feature selection—identifying the most impactful variables and delivering an interpretable model that helps operations teams budget and price deliveries effectively.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Cost Prediction for a Logistics Company
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load the logistics cost dataset containing shipment-level features and inspect its structure to understand variable types and distributions.
# Block 1: Load dataset
# Kaggle Competition: Cost Prediction for Logistic Company
df = pd.read_csv("train.csv") # downloaded via Kaggle API or local path
# Inspect basic structure
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
- Convert Shipment_Date to extract DayOfWeek for temporal patterns.
- One‑hot encode Vehicle_Type and Service_Level.
- Drop the original date and any incomplete records.
- Split into predictors (X) and response (y), then into training and test sets.
# Block 2: Clean & encode
# Assume columns: 'Distance_km', 'Weight_kg', 'Volume_m3',
# 'Vehicle_Type', 'Service_Level', 'Shipment_Date', target 'Cost_USD'
# Extract day-of-week to capture temporal effects
df['Shipment_Date'] = pd.to_datetime(df['Shipment_Date'])
df['DayOfWeek'] = df['Shipment_Date'].dt.weekday
# One‑hot encode categorical variables
df_enc = pd.get_dummies(df,
columns=['Vehicle_Type','Service_Level'],
drop_first=True)
# Drop unused columns and any rows with missing values
df_enc = df_enc.drop(columns=['Shipment_Date']).dropna()
# Define predictors and target
X = df_enc.drop('Cost_USD', axis=1)
y = df_enc['Cost_USD']
# Train–test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function performs a hybrid forward‑backward algorithm:
- Forward inclusion: adds the excluded predictor with the lowest p‑value below 0.01.
- Backward elimination: removes the included predictor with the highest p‑value above 0.05.
Repeats until no variables ahead meet entry/removal criteria, yielding a succinct set of significant cost drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: consider adding each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.4f}")
# Backward step: consider removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Using statsmodels, we fit an Ordinary Least Squares regression on selected features and review coefficient estimates, p‑values, R², and other diagnostics to interpret each variable’s impact.
- We depict costs on the held‑out test set and compute R² (explained variance) and RMSE (prediction error), quantifying out‑of‑sample performance.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
Plotting residuals versus predicted values checks for patterns or heteroscedasticity, validating core assumptions of linear regression, and ensuring model reliability.
# Block 5: Residual plot to check linear assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Delivery Cost")
plt.show()
Summary
Applying stepwise regression to freight cost data isolates the most influential predictors—such as distance, shipment weight, vehicle type, service level, and day‑of‑week effects—while pruning redundant variables.
The resulting linear model balances interpretability (through clear coefficient estimates and significance tests) with predictive accuracy (high R², low RMSE), providing logistics planners a transparent, data‑driven tool to forecast delivery costs and optimize pricing and routing strategies.