Train Ridership Growth Prediction with Polynomial Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Transit agencies need to forecast daily train ridership growth (%)—the per cent change in total boardings from one day to the next—using only features available at dispatch time (prior‑day ridership, day‑of‑week, month, and service‑level flags) to adjust schedules and staffing in real time. Ridership curves exhibit nonlinear patterns: growth tapers on weekends, surges around holidays, and interacts with seasonality. A simple linear model underfits these curvatures, while an unconstrained polynomial overfits noise. By fitting a Polynomial Regression model on engineered features with Ridge (ℓ²) regularisation, we can learn a smooth, interpretable growth model that generalises well across operational conditions.
Dataset
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Load Data & Compute Growth
Target calculation: growth_pct = percent change from prior day’s total boardings—a direct measure of ridership momentum.
import pandas as pd
# Load and parse date
df = pd.read_csv("data/mta_daily_ridership.csv", parse_dates=["date"])
# Assume columns: ['date','entries','exits']; total boardings ≈ entries+exits
df['ridership'] = df['entries'] + df['exits']
# Compute day‑over‑day growth (%)
df = df.sort_values("date")
df['ridership_prev'] = df['ridership'].shift(1)
df['growth_pct'] = (df['ridership'] - df['ridership_prev']) / df['ridership_prev'] * 100
# Drop first row with NaN
df = df.dropna(subset=['growth_pct'])
3. Feature Engineering
- ridership_prev captures inertia;
- day_of_week and month model weekly/seasonal patterns.
# Extract calendar features df['day_of_week'] = df['date'].dt.dayofweek # 0=Mon…6=Sun df['month'] = df['date'].dt.month # Select predictor matrix and target X = df[['ridership_prev','day_of_week','month']] y = df['growth_pct']
4. Build a Polynomial Regression Pipeline
- StandardScaler normalises ridership_prev.
- PolynomialFeatures expands inputs into powers and interactions (e.g., (ridership_prev)², ridership_prev×month), capturing curvature and seasonality synergy.
- Ridge regression applies an ℓ² penalty (alpha) to shrink noisy high-order coefficients.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()), # normalize ridership scale
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
5. Train/Test Split & Hyperparameter Search
- degree (1–3) balances underfitting vs. overfitting.
- alpha (10⁻³–10³) controls regularisation strength.
- A 5‑fold time‑aware split optimises RMSE on held‑out folds.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=False # preserve time order
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best degree:", gs.best_params_['poly__degree'])
print("Best alpha :", gs.best_params_['ridge__alpha'])
6. Evaluate Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f}% growth")
print(f"Test R² : {r2:.3f}")
7. Inspect Key Polynomial Coefficients
Interpretation: The most significant coefficients—such as ridership_prev² or ridership_prev×day_of_week_6—reveal how prior ridership and day‑of‑week interactions drive growth rates.
# Reconstruct feature names after expansion
poly = gs.best_estimator_.named_steps['poly']
input_feats = ['ridership_prev','day_of_week','month']
feat_names = poly.get_feature_names_out(input_features=input_feats)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
imp = pd.Series(coefs, index=feat_names).abs() \
.sort_values(ascending=False).head(10)
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Ridership Growth")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression approach with Ridge regularisation delivers a robust, interpretable model for daily train ridership growth:
- Captures nonlinear momentum effects from prior ridership (e.g., diminishing returns on busy days).
- Accounts for weekly/seasonal cycles via day‑of‑week and month terms and their interactions.
- Controls complexity through grid‑searched degree and α values, achieving low RMSE and high R² while yielding clear insights into ridership dynamics.