Urban Logistics Fare Prediction using Quantile Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Urban logistics operations—such as taxi and ride‐hail services—must budget for the full distribution of trip fares, not just the mean. Trips in low‐demand areas yield lower fares (10th percentile), while peak‐hour, long‐distance rides push fares to the upper tail (90th percentile). Relying on average fare estimates risks mispricing service and misallocating driver resources.
In this urban logistics fare prediction in ML, we’ll predict the 10th, 50th, and 90th percentiles of trip fare (fare_amount, USD) using the New York City Taxi Fare Prediction dataset. Features include trip distance, passenger count, pickup time, and geospatial coordinates.
By fitting separate quantile regression models, we’ll uncover how each factor drives low‑, median‑, and high‑fare scenarios—helping urban logistics platforms set dynamic pricing floors, plan around typical fares, and provision for premium trips.
Libraries Required
import pandas as pd import numpy as np from math import radians, sin, cos, sqrt, atan2 import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts
Dataset
New York City Taxi Fare Prediction
Step-by-Step Code Implementation
Load & Inspect Data
We load a 500 k‑sample of taxi trips, each with pickup/dropoff coordinates, pickup timestamp, passenger count, and fare. We inspect data types and fare distribution to confirm validity.
# Load the NYC Taxi Fare dataset
# Source: Kaggle – New York City Taxi Fare Prediction
df = pd.read_csv("train.csv", nrows=500000) # sample for speed
# Preview
print(df.head())
print(df.info())
print(df['fare_amount'].describe())
Preprocessing & Feature Engineering
- We parse pickup_datetime to extract hour and weekday, capturing temporal demand patterns.
- We compute the haversine distance (km) between pickup and dropoff points as a core driver of fare.
- We filter out non‑positive fares and distances, then select four predictors (distance_km, passenger_count, hour, weekday) and rename fare_amount to Fare.
# Parse pickup datetime
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], utc=True)
df['hour'] = df['pickup_datetime'].dt.hour
df['weekday'] = df['pickup_datetime'].dt.weekday
# Compute haversine distance
def haversine(lon1, lat1, lon2, lat2):
# convert degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1; dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
return 6371 * 2 * atan2(sqrt(a), sqrt(1-a)) # km
df['distance_km'] = df.apply(lambda row: haversine(
row.pickup_longitude, row.pickup_latitude,
row.dropoff_longitude, row.dropoff_latitude), axis=1)
# Filter out invalid fares/distances
df = df[(df['fare_amount'] > 0) & (df['distance_km'] > 0)]
# Define predictors and response
features = ['distance_km','passenger_count','hour','weekday']
df = df[features + ['fare_amount']].dropna()
df = df.rename(columns={'fare_amount':'Fare'})
Train/Test Split
We randomly hold out 20% of trips for out‑of‑sample evaluation, ensuring our quantile models generalize to new trips.
# 80/20 split train, test = train_test_split(df, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each target percentile (10th, 50th, 90th):
- We specify a formula string (e.g. “Fare ~ distance_km + passenger_count + hour + weekday”).
- We fit a QuantReg model on the training set at that quantile.
- We print the coefficient table, revealing how each predictor’s effect on fare shifts across the fare distribution—for instance, per‑km rate may be lower in the lower tail than in the upper tail.
quantiles = [0.10, 0.50, 0.90]
models = {}
formula = "Fare ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
models[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1])
Evaluation with Pinball Loss
- We generate quantile‑specific fare predictions on the test set.
- We compute pinball loss for each quantile—a proper scoring metric that penalises under‑ and over‑predictions asymmetrically according to the target percentile. Lower pinball loss indicates better calibration of our quantile forecasts.
for q, res in models.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Fare'], preds, alpha=q)
print(f"{int(q*100)}th percentile pinball loss: {loss:.3f}")
Summary
By applying quantile regression to urban taxi data, we obtain distribution‑aware fare forecasts:
- 10th‑percentile model guides budget‐service pricing, anticipating low‑fare, short trips.
- Median (50th‑percentile) model predicts typical trip fares, informing standard pricing strategies.
- The 90th‑percentile model forecasts premium-fare scenarios, such as peak‑hour or long‑distance trips, ensuring dynamic pricing reserves.
These quantile‑specific insights empower urban logistics platforms to set adaptive rates, manage driver supply, and optimize revenue under demand variability—ultimately improving both customer satisfaction and operational efficiency.