Train Ridership Growth Prediction with Polynomial Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Transit agencies need to forecast daily train ridership growth (%)—the per cent change in total boardings from one day to the next—using only features available at dispatch time (prior‑day ridership, day‑of‑week, month, and service‑level flags) to adjust schedules and staffing in real time. Ridership curves exhibit nonlinear patterns: growth tapers on weekends, surges around holidays, and interacts with seasonality. A simple linear model underfits these curvatures, while an unconstrained polynomial overfits noise. By fitting a Polynomial Regression model on engineered features with Ridge (ℓ²) regularisation, we can learn a smooth, interpretable growth model that generalises well across operational conditions.

Dataset

MTA Daily Ridership Data

Step-by-Step Code Implementation

1. Libraries Required

import pandas as pd                     # data loading & handling  
import numpy as np                      # numerical operations  

import matplotlib.pyplot as plt         # plotting  
import seaborn as sns                   # enhanced visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

2. Load Data & Compute Growth

Target calculation: growth_pct = percent change from prior day’s total boardings—a direct measure of ridership momentum.

import pandas as pd

# Load and parse date
df = pd.read_csv("data/mta_daily_ridership.csv", parse_dates=["date"])

# Assume columns: ['date','entries','exits']; total boardings ≈ entries+exits
df['ridership'] = df['entries'] + df['exits']

# Compute day‑over‑day growth (%)
df = df.sort_values("date")
df['ridership_prev'] = df['ridership'].shift(1)
df['growth_pct'] = (df['ridership'] - df['ridership_prev']) / df['ridership_prev'] * 100

# Drop first row with NaN
df = df.dropna(subset=['growth_pct'])

3. Feature Engineering

  • ridership_prev captures inertia;
  • day_of_week and month model weekly/seasonal patterns.
# Extract calendar features
df['day_of_week'] = df['date'].dt.dayofweek       # 0=Mon…6=Sun
df['month']       = df['date'].dt.month

# Select predictor matrix and target
X = df[['ridership_prev','day_of_week','month']]
y = df['growth_pct']

4. Build a Polynomial Regression Pipeline

  • StandardScaler normalises ridership_prev.
  • PolynomialFeatures expands inputs into powers and interactions (e.g., (ridership_prev)², ridership_prev×month), capturing curvature and seasonality synergy.
  • Ridge regression applies an ℓ² penalty (alpha) to shrink noisy high-order coefficients.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ('scale', StandardScaler()),          # normalize ridership scale
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ridge', Ridge(random_state=42))
])

5. Train/Test Split & Hyperparameter Search

  • degree (1–3) balances underfitting vs. overfitting.
  • alpha (10⁻³–10³) controls regularisation strength.
  • A 5‑fold time‑aware split optimises RMSE on held‑out folds.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=False  # preserve time order
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best degree:", gs.best_params_['poly__degree'])
print("Best alpha :", gs.best_params_['ridge__alpha'])

6. Evaluate Model

from sklearn.metrics import mean_squared_error, r2_score

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.2f}% growth")
print(f"Test R²  : {r2:.3f}")

7. Inspect Key Polynomial Coefficients

Interpretation: The most significant coefficients—such as ridership_prev² or ridership_prev×day_of_week_6—reveal how prior ridership and day‑of‑week interactions drive growth rates.

# Reconstruct feature names after expansion
poly   = gs.best_estimator_.named_steps['poly']
input_feats = ['ridership_prev','day_of_week','month']
feat_names  = poly.get_feature_names_out(input_features=input_feats)

coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
imp = pd.Series(coefs, index=feat_names).abs() \
         .sort_values(ascending=False).head(10)

import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Ridership Growth")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()

Summary

This Polynomial Regression approach with Ridge regularisation delivers a robust, interpretable model for daily train ridership growth:

  • Captures nonlinear momentum effects from prior ridership (e.g., diminishing returns on busy days).
  • Accounts for weekly/seasonal cycles via day‑of‑week and month terms and their interactions.
  • Controls complexity through grid‑searched degree and α values, achieving low RMSE and high R² while yielding clear insights into ridership dynamics.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *