Transport Logistics Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Logistics providers aim to predict the cost of each delivery to optimise routing, budgeting, and pricing strategies. In this project, we will predict trip costs for a logistics network using shipment attributes (distance, weight, volume), vehicle characteristics, and temporal factors (day of week, month). By applying stepwise regression, we’ll identify the strongest cost drivers and build an interpretable linear model that balances simplicity with predictive accuracy—helping operations teams make data‑driven decisions on rate setting and route planning.
Libraries Required
import pandas as pd # Data manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Statistical modeling from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Model evaluation import matplotlib.pyplot as plt # Plotting residuals
Dataset
Cost Prediction for Logistic Company
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We import the competition’s training CSV and examine its structure, checking columns such as Distance, Weight, Vehicle_Type, Origin, Destination, and the target Cost.
# Block 1: Load dataset
# Competition page: Cost Prediction for Logistic Company :contentReference[oaicite:0]{index=0}
df = pd.read_csv("train.csv") # assume train.csv from competition download
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
Categorical features (Vehicle_Type, Origin, Destination) are one‑hot encoded; missing values (if any) are dropped. We separate the predictors (X) from the response (y), then split the data into training and test sets (80/20).
# Block 2: Encode categoricals and clean
# Example categorical columns: Vehicle_Type, Origin, Destination
df_enc = pd.get_dummies(df,
columns=["Vehicle_Type", "Origin", "Destination"],
drop_first=True)
# Handle missing values (if any)
df_enc = df_enc.dropna()
# Define features and target
X = df_enc.drop("Cost", axis=1)
y = df_enc["Cost"]
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function alternates between forward selection (adding the predictor with the lowest p‑value < 0.01) and backward elimination (removing the predictor with the highest p‑value > 0.05), iterating until no further changes occur. This yields a concise subset of cost drivers.
# Block 3: Forward‑backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_feature = new_pvals.idxmin()
included.append(best_feature)
changed = True
if verbose:
print(f"Add {best_feature:30} p-value {best_pval:.6f}")
# Backward step
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # omit intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_feature = pvals.idxmax()
included.remove(worst_feature)
changed = True
if verbose:
print(f"Drop {worst_feature:30} p-value {worst_pval:.6f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: With selected features, we fit an Ordinary Least Squares regression using statsmodels. The .summary() output shows coefficient estimates, p‑values, R², and diagnostic statistics.
- Evaluation: We predict on unseen test data and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify performance.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict & evaluate
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A scatter plot of residuals vs predictions checks for nonrandom patterns or heteroscedasticity, thereby validating the linear model assumptions.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
Stepwise regression on the logistics cost dataset isolates key predictors—such as distance, weight, and specific vehicle or route categories—while discarding less informative factors. The resulting linear model achieves strong explanatory power (high R²) and low prediction error (RMSE) on held‑out data. Logistics managers can leverage these insights to refine pricing strategies, optimise fleet allocation, and improve profitability through data‑driven cost management.