Delivery Time Curve Prediction with Polynomial Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Logistics managers and last‑mile operations teams need to forecast the delivery time (minutes) of packages—based on early indicators available at dispatch—so they can proactively adjust routing, staffing, and customer notifications. Historical delivery records show that time depends nonlinearly on factors such as pickup‑to‑dropoff distance, the number of stops on the route, traffic congestion level, time of day, and driver experience.
A simple linear model underestimates curvature (e.g., slowing returns on speed at longer distances) and fails to capture interactions (e.g., rush‑hour distance penalties). At the same time, an irregularised high‑degree polynomial overfits noise. By applying Polynomial Regression on engineered features with Ridge (ℓ²) regularisation, we can learn smooth, interpretable delivery‑time curves that generalize well to new routes and conditions.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Visualization | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, StandardScaler, PolynomialFeatures, Pipeline |
| Regression model | Ridge |
| Model selection | train_test_split, GridSearchCV |
| Evaluation | mean_squared_error, r2_score |
Dataset
Step-by-Step Code Implementation
Import Libraries & Load Data
import pandas as pd
import numpy as np
# Load training data (after downloading and unzipping)
df = pd.read_csv("data/train.csv", parse_dates=["pickup_time","dropoff_time"])
# Preview key columns
df.head()[[
"pickup_time","dropoff_time","distance_km",
"num_stops","traffic_level","driver_experience_yrs"
]]
Feature Engineering & Target Creation
Generates squared and interaction terms (e.g., distance_km², distance_km×traffic_level_2) to model curvature and cross‑effects.;
# Compute delivery time in minutes
df["delivery_time_min"] = (df["dropoff_time"] - df["pickup_time"]) \
.dt.total_seconds() / 60
# Extract time‑of‑day as a categorical feature (hour)
df["hour_of_day"] = df["pickup_time"].dt.hour
# Select and clean features
# - distance_km: straight‑line distance
# - num_stops: number of scheduled stops before dropoff
# - traffic_level: categorical indicator (1=low,2=medium,3=high)
# - driver_experience_yrs: years of experience
# - hour_of_day: captures rush‑hour effects
features = ["distance_km","num_stops","traffic_level",
"driver_experience_yrs","hour_of_day"]
df = df.dropna(subset=features + ["delivery_time_min"])
X = df[features]
y = df["delivery_time_min"]
Build a Polynomial Regression Pipeline
- StandardScaler on numeric features ensures the Ridge penalty treats them uniformly.
- OneHotEncoder on traffic_level and hour_of_day captures categorical effects without ordinality assumptions.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
# Separate numeric vs categorical
num_cols = ["distance_km","num_stops","driver_experience_yrs"]
cat_cols = ["traffic_level","hour_of_day"]
preprocessor = ColumnTransformer([
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(drop="first"), cat_cols)
])
pipe = Pipeline([
("prep", preprocessor),
("poly", PolynomialFeatures(include_bias=False)),
("ridge", Ridge(max_iter=20000, random_state=42))
])
Train/Test Split & Hyperparameter Search
- Explores polynomial degrees 1–3 and α from 0.001 to 100.
- 5‑fold cross‑validation identifies the combination that minimises RMSE.
- Applies ℓ² regularisation (controlled by alpha) to shrink noisy high‑order coefficients, preventing overfitting.
from sklearn.model_selection import train_test_split, GridSearchCV
# Temporal split isn’t critical here; random split suffices for cross‑sectional data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
"poly__degree": [1, 2, 3],
"ridge__alpha": np.logspace(-3, 2, 6) # 0.001 → 100
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring="neg_root_mean_squared_error",
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best degree :", gs.best_params_["poly__degree"])
print("Best alpha :", gs.best_params_["ridge__alpha"])
Evaluate Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE : {rmse:.2f} minutes")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Inspecting the most significant coefficients reveals which nonlinear and interaction effects (such as a higher penalty for long-distance travel during high‑traffic hours) most influence delivery time predictions.
# Retrieve feature names after preprocessing & expansion
prep = gs.best_estimator_.named_steps["prep"]
num_features = num_cols
cat_features = prep.named_transformers_["cat"] \
.get_feature_names_out(cat_cols).tolist()
input_feats = num_features + cat_features
poly = gs.best_estimator_.named_steps["poly"]
feat_names = poly.get_feature_names_out(input_features=input_feats)
coefs = gs.best_estimator_.named_steps["ridge"].coef_
import pandas as pd
imp = pd.Series(coefs, index=feat_names) \
.abs().sort_values(ascending=False).head(10)
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
imp.plot(kind="barh")
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Delivery Time")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression pipeline with Ridge regularisation provides:
- Accurate, smooth modelling of delivery‑time dynamics, capturing nonlinear distance and traffic interactions.
- Controlled complexity via grid‑searched polynomial degree and α, avoiding overfitting to outliers.
- Interpretable insights through top polynomial features—guiding logistics teams on the most critical route‑ and time‑of‑day effects to manage for on‑time deliveries.