Transport Delay Prediction using Quantile Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Logistics and transit operators need to understand not just the average delay but the range of possible outcomes—preparing for light‐traffic days (10th percentile) or heavy congestion (90th percentile).

In this project, we’ll predict the 10th, 50th, and 90th quantiles of flight arrival delays (in minutes) using routine flight attributes—such as scheduled departure time, carrier, origin airport, and distance—by fitting separate quantile regression models.

Libraries Required

import pandas as pd                   # Data loading & manipulation  
import numpy as np                    # Numerical operations  
import statsmodels.formula.api as smf # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  # Train/test split  
from sklearn.metrics import mean_pinball_loss        # Quantile loss metric  

Dataset

Flight Delay Dataset 2018–2024

Step-by-Step Code Implementation

Load & Inspect Data

We load a comprehensive U.S. flight‐delay dataset (2018–2024) that includes scheduled times, carriers, origins, distances, and actual arrival delays. Initial checks (.info(), .describe()) ensure no critical missing values.

# Load flight delay data (2018–2024)
# Source: Flight Delay Dataset 2018-2024 – Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("flight_delay_dataset_2018_2024.csv")

# Quick inspection
print(df.shape)       # rows, columns
print(df.info())      # types & missingness
print(df[['ArrDelayMinutes']].describe())  # delay distribution

Preprocessing

  • We select six key predictors: month, day of week, scheduled departure time (CRSDepTime), carrier code (UniqueCarrier), origin airport (Origin), and flight distance.
  • We drop rows with missing ArrDelayMinutes and keep negative delays (early arrivals) to model the full distribution.
  • Categorical variables (DayOfWeek, UniqueCarrier, Origin) are one‐hot encoded, excluding one level each to avoid multicollinearity.
  • We assemble our feature matrix X and response y.
# Select relevant columns and drop missing delays
df = df[['Month','DayOfWeek','CRSDepTime','UniqueCarrier',
         'Origin','Distance','ArrDelayMinutes']].dropna()

# Filter out canceled flights (ArrDelayMinutes marked 0 for early arrivals allowed)
# Keep full range including negative (early) delays
df = df[df['ArrDelayMinutes'].notnull()]

# One‑hot encode categorical features
df = pd.get_dummies(df,
    columns=['DayOfWeek','UniqueCarrier','Origin'],
    drop_first=True)

# Define predictors and response
features = [c for c in df.columns
            if c != 'ArrDelayMinutes']
X = df[features]
y = df['ArrDelayMinutes']
data = pd.concat([X, y], axis=1)

Train/Test Split

We randomly hold out 20% of flights for out‐of‐sample evaluation, ensuring quantile models generalize to new data.

# Reserve 20% for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each quantile (10th, 50th, 90th):

  • We construct a formula string relating ArrDelayMinutes to the predictors.
  • We fit a QuantReg model at that quantile on the training set.
  • We print the coefficient table, showing how each factor’s effect differs across light, median, and heavy delays (e.g., some carriers may have smaller median delays but larger extremes).
quantiles = [0.1, 0.5, 0.9]
results   = {}
formula   = "ArrDelayMinutes ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    results[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])  # just the coefficient table

Evaluation with Pinball Loss

  • We predict quantile‐specific delays on the test set.
  • We compute pinball loss for each quantile, a proper scoring rule that penalizes under‐ and over‐predictions asymmetrically, matching the quantile’s focus. Lower pinball loss indicates better quantile fit.
for q, res in results.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['ArrDelayMinutes'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

Quantile regression captures the heterogeneous effects of flight features across different points in the delay distribution. For instance, rainy months may increase the 90th‐percentile delay far more than the median.

By modeling the 10th, 50th, and 90th percentiles separately, airlines and air‐traffic managers gain distribution‐aware forecasts—planning for best‐case punctuality, typical performance, and worst‐case congestion. These insights support robust scheduling, crew planning, and passenger‐communication strategies under uncertainty.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *