Urban Logistics Fare Prediction using Quantile Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Urban logistics operations—such as taxi and ride‐hail services—must budget for the full distribution of trip fares, not just the mean. Trips in low‐demand areas yield lower fares (10th percentile), while peak‐hour, long‐distance rides push fares to the upper tail (90th percentile). Relying on average fare estimates risks mispricing service and misallocating driver resources.

In this urban logistics fare prediction in ML, we’ll predict the 10th, 50th, and 90th percentiles of trip fare (fare_amount, USD) using the New York City Taxi Fare Prediction dataset. Features include trip distance, passenger count, pickup time, and geospatial coordinates.

By fitting separate quantile regression models, we’ll uncover how each factor drives low‑, median‑, and high‑fare scenarios—helping urban logistics platforms set dynamic pricing floors, plan around typical fares, and provision for premium trips.

Libraries Required

import pandas as pd  
import numpy as np  
from math import radians, sin, cos, sqrt, atan2  
import statsmodels.formula.api as smf       # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss for quantile forecasts  

Dataset

New York City Taxi Fare Prediction

Step-by-Step Code Implementation

Load & Inspect Data

We load a 500 k‑sample of taxi trips, each with pickup/dropoff coordinates, pickup timestamp, passenger count, and fare. We inspect data types and fare distribution to confirm validity.

# Load the NYC Taxi Fare dataset
# Source: Kaggle – New York City Taxi Fare Prediction 
df = pd.read_csv("train.csv", nrows=500000)  # sample for speed

# Preview
print(df.head())
print(df.info())
print(df['fare_amount'].describe())

Preprocessing & Feature Engineering

  • We parse pickup_datetime to extract hour and weekday, capturing temporal demand patterns.
  • We compute the haversine distance (km) between pickup and dropoff points as a core driver of fare.
  • We filter out non‑positive fares and distances, then select four predictors (distance_km, passenger_count, hour, weekday) and rename fare_amount to Fare.
# Parse pickup datetime
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], utc=True)
df['hour'] = df['pickup_datetime'].dt.hour
df['weekday'] = df['pickup_datetime'].dt.weekday

# Compute haversine distance
def haversine(lon1, lat1, lon2, lat2):
    # convert degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1; dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
    return 6371 * 2 * atan2(sqrt(a), sqrt(1-a))  # km

df['distance_km'] = df.apply(lambda row: haversine(
    row.pickup_longitude, row.pickup_latitude,
    row.dropoff_longitude, row.dropoff_latitude), axis=1)

# Filter out invalid fares/distances
df = df[(df['fare_amount'] > 0) & (df['distance_km'] > 0)]

# Define predictors and response
features = ['distance_km','passenger_count','hour','weekday']
df = df[features + ['fare_amount']].dropna()
df = df.rename(columns={'fare_amount':'Fare'})

Train/Test Split

We randomly hold out 20% of trips for out‑of‑sample evaluation, ensuring our quantile models generalize to new trips.

# 80/20 split
train, test = train_test_split(df, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each target percentile (10th, 50th, 90th):

  • We specify a formula string (e.g. “Fare ~ distance_km + passenger_count + hour + weekday”).
  • We fit a QuantReg model on the training set at that quantile.
  • We print the coefficient table, revealing how each predictor’s effect on fare shifts across the fare distribution—for instance, per‑km rate may be lower in the lower tail than in the upper tail.
quantiles = [0.10, 0.50, 0.90]
models    = {}
formula   = "Fare ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])

Evaluation with Pinball Loss

  • We generate quantile‑specific fare predictions on the test set.
  • We compute pinball loss for each quantile—a proper scoring metric that penalises under‑ and over‑predictions asymmetrically according to the target percentile. Lower pinball loss indicates better calibration of our quantile forecasts.
for q, res in models.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Fare'], preds, alpha=q)
    print(f"{int(q*100)}th percentile pinball loss: {loss:.3f}")

Summary

By applying quantile regression to urban taxi data, we obtain distribution‑aware fare forecasts:

  • 10th‑percentile model guides budget‐service pricing, anticipating low‑fare, short trips.
  • Median (50th‑percentile) model predicts typical trip fares, informing standard pricing strategies.
  • The 90th‑percentile model forecasts premium-fare scenarios, such as peak‑hour or long‑distance trips, ensuring dynamic pricing reserves.

These quantile‑specific insights empower urban logistics platforms to set adaptive rates, manage driver supply, and optimize revenue under demand variability—ultimately improving both customer satisfaction and operational efficiency.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *