Delivery Time Prediction using Linear Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Quick and reliable delivery-time estimates keep customers happy and help platforms schedule riders efficiently. Using the open Food Delivery Time dataset, we build a linear regression baseline that predicts the total minutes from order placement to drop-off for each shipment. A transparent line reveals first‑order drivers—distance, prep delay, rider ratings, weather, traffic—and gives every fancier model a hard benchmark to beat.
Libraries Required
- pandas # data wrangling
- numpy # numeric helpers
- matplotlib.pyplot # sanity‑check visuals
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression? Within normal operating ranges, delivery duration rises roughly linearly with distance and prep delay, while traffic or weather adds near‑constant penalties. A straight‑line fit exposes each driver’s minute‑per‑unit contribution and is trivial to explain to ops teams.
1. Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
(Download the CSV first—and adjust the path.)
df = pd.read_csv("food_delivery_time.csv")
print(df.head())
3. Minimal cleaning
# strip leading/trailing spaces in categorical columns
cat_cols_raw = ['Weather_conditions', 'Road_traffic_density',
'Festival', 'City']
for col in cat_cols_raw:
df[col] = df[col].str.strip()
# drop rows with missing critical fields
df = df.dropna(subset=['Order_Date', 'Time_Orderd',
'Time_Order_picked', 'Restaurant_latitude',
'Restaurant_longitude',
'Delivery_location_latitude',
'Delivery_location_longitude',
'Time_taken (min)'])
5. Feature engineering
- Preparation lag (order → pickup) is often the biggest uncertainty knob. Computing it in minutes decouples kitchen speed from travel speed so that the model can weigh them separately.
- Haversine distance converts raw lat/long pairs into a single, physically meaningful kilometre feature—hugely predictive yet cheap to compute.`
- Calendar cue (order_dayofweek) captures the mid-week lull and weekend spikes without requiring additional data feeds.
# ---- 3.4.1 Order → pickup delay (minutes) ----
dt_fmt = "%H:%M:%S"
df['prep_minutes'] = (
pd.to_timedelta(df['Time_Order_picked']) -
pd.to_timedelta(df['Time_Orderd'])
).dt.total_seconds() / 60.0
# ---- 3.4.2 Haversine distance (km) ----
def haversine(lat1, lon1, lat2, lon2):
R = 6371 # Earth radius in km
phi1, phi2 = np.radians(lat1), np.radians(lat2)
dphi = np.radians(lat2 - lat1)
dl = np.radians(lon2 - lon1)
a = np.sin(dphi/2)**2 + np.cos(phi1)*np.cos(phi2)*np.sin(dl/2)**2
return 2*R*np.arcsin(np.sqrt(a))
df['distance_km'] = haversine(df['Restaurant_latitude'],
df['Restaurant_longitude'],
df['Delivery_location_latitude'],
df['Delivery_location_longitude'])
# ---- 3.4.3 Encode order day‑of‑week ----
df['order_dayofweek'] = pd.to_datetime(df['Order_Date']).dt.dayofweek
6. Define predictors & label
num_cols = ['Delivery_person_Age', 'Delivery_person_Ratings',
'Vehicle_condition', 'multiple_deliveries',
'prep_minutes', 'distance_km', 'order_dayofweek']
cat_cols = ['Weather_conditions', 'Road_traffic_density',
'Festival', 'City']
target = 'Time_taken (min)'
X = df[num_cols + cat_cols]
y = df[target]
7. Pre‑processing & model pipeline
ColumnTransformer + Pipeline combines scaling, one-hot encoding, and the regressor into a single serializable object—eliminating the risk of misaligned preprocessing in production.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
8. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
9. Metrics chosen
R² shows variance explained; MAE speaks the language of operations (“our typical miss is ±4.3 min”).
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} minutes")
10. Inspect top coefficients
Coefficient inspection instantly identifies bottlenecks: e.g., a +7-minute weight for High traffic or +5 minutes for rainy weather, guiding dispatch rules even before complex models are rolled out.
# recover encoded feature names
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
feature_names = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_,
index=feature_names).sort_values()
print("\nFast‑delivery factors:")
print(coefs.head(8))
print("\nDelay‑inducing factors:")
print(coefs.tail(8))
11. Persist the trained pipeline
joblib.dump(pipe, "delivery_time_linreg.pkl")
Summary
In ~70 lines of Python, we turned raw order logs into an explainable delivery‑time estimator. The linear model delivers:
- Actionable forecasts for customer ETAs and rider scheduling.
- Crystal‑clear elasticity numbers that show how distance, kitchen prep, traffic, and weather tug on delivery duration.
Use this interpretable baseline as your compass; when you upgrade to gradient‑boosted trees or neural networks, you’ll know exactly how much extra predictive punch the complexity buys.