Transport Logistics Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Logistics providers aim to predict the cost of each delivery to optimise routing, budgeting, and pricing strategies. In this project, we will predict trip costs for a logistics network using shipment attributes (distance, weight, volume), vehicle characteristics, and temporal factors (day of week, month). By applying stepwise regression, we’ll identify the strongest cost drivers and build an interpretable linear model that balances simplicity with predictive accuracy—helping operations teams make data‑driven decisions on rate setting and route planning.

Libraries Required

import pandas as pd               # Data manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Statistical modeling  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Model evaluation  
import matplotlib.pyplot as plt   # Plotting residuals

Dataset

Cost Prediction for Logistic Company

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import the competition’s training CSV and examine its structure, checking columns such as Distance, Weight, Vehicle_Type, Origin, Destination, and the target Cost.

# Block 1: Load dataset
# Competition page: Cost Prediction for Logistic Company :contentReference[oaicite:0]{index=0}
df = pd.read_csv("train.csv")     # assume train.csv from competition download
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Categorical features (Vehicle_Type, Origin, Destination) are one‑hot encoded; missing values (if any) are dropped. We separate the predictors (X) from the response (y), then split the data into training and test sets (80/20).

# Block 2: Encode categoricals and clean
# Example categorical columns: Vehicle_Type, Origin, Destination
df_enc = pd.get_dummies(df, 
                        columns=["Vehicle_Type", "Origin", "Destination"], 
                        drop_first=True)

# Handle missing values (if any)
df_enc = df_enc.dropna()

# Define features and target
X = df_enc.drop("Cost", axis=1)
y = df_enc["Cost"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function alternates between forward selection (adding the predictor with the lowest p‑value < 0.01) and backward elimination (removing the predictor with the highest p‑value > 0.05), iterating until no further changes occur. This yields a concise subset of cost drivers.

# Block 3: Forward‑backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False
        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_feature = new_pvals.idxmin()
            included.append(best_feature)
            changed = True
            if verbose:
                print(f"Add  {best_feature:30} p-value {best_pval:.6f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # omit intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_feature = pvals.idxmax()
            included.remove(worst_feature)
            changed = True
            if verbose:
                print(f"Drop {worst_feature:30} p-value {worst_pval:.6f}")
        if not changed:
            break
    return included

Model Building & Evaluation

Model Fitting: With selected features, we fit an Ordinary Least Squares regression using statsmodels. The .summary() output shows coefficient estimates, p‑values, R², and diagnostic statistics.
Evaluation: We predict on unseen test data and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify performance.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict & evaluate
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A scatter plot of residuals vs predictions checks for nonrandom patterns or heteroscedasticity, thereby validating the linear model assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

Stepwise regression on the logistics cost dataset isolates key predictors—such as distance, weight, and specific vehicle or route categories—while discarding less informative factors. The resulting linear model achieves strong explanatory power (high R²) and low prediction error (RMSE) on held‑out data. Logistics managers can leverage these insights to refine pricing strategies, optimise fleet allocation, and improve profitability through data‑driven cost management.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google | Facebook