Traffic Congestion Cost Prediction with Lasso Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Urban gridlock wastes fuel, pollutes the air, and drains productivity. To help city planners put a price tag on that waste, we will build a Lasso‑regularised linear model that:
- Predicts the economic cost (USD) of each congestion event based on delay length, traffic mix, weather, road type, and time‑of‑day features.
- Pinpoints the handful of variables that drive most of the cost—thanks to Lasso’s ℓ1 penalty—which enables budget‑conscious interventions such as retiming signals or adding bus lanes.
The target variable will be a derived congestion cost measure (delay minutes × value of time × vehicles affected).
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| Modelling pipeline | scikit‑learn → Lasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV |
| Metrics | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
The Kaggle file logs 5.9 million congestion events across 49 states, with timestamps, weather, road geometry, and delay metrics.
# One‑time shell command (needs Kaggle API credentials):
# kaggle datasets download -d sobhanmoosavi/us-traffic-congestions-2016-2022 -p data --unzip
data = pd.read_csv("data/traffic_congestion.csv") # adjust filename if needed
3. Engineering a “congestion cost” target
Assigning a dollar value per vehicle‑minute converts abstract delay into a concrete cost that policy‑makers can act on. Multiplying by VehiclesAffected scales the impact from individual drivers to society‑level loss.
# Assume the raw dataset includes delay in seconds and an estimate of vehicle count VALUE_OF_TIME = 0.30 # USD per vehicle‑minute (adjustable) data['delay_minutes'] = data['Delay'] / 60 # convert to minutes data['congestion_cost'] = data['delay_minutes'] * data['VehiclesAffected'] * VALUE_OF_TIME y = data['congestion_cost'] X = data.drop(columns=['congestion_cost'])
4. Pre‑processing recipe
ColumnTransformer one‑hot‑encodes categorical attributes (road class, weather condition, state) and z‑scales numeric ones (temperature, humidity, traffic speed). This standardisation ensures Lasso’s penalty treats each variable fairly.
cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
5. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
6. Build & tune Lasso pipeline
Encapsulating pre‑processing and modelling in a single Pipeline prevents data leakage. A log‑spaced grid search of α balances sparsity and fit; five‑fold CV guards against seasonal quirks in the seven‑year dataset.
pipe = Pipeline([
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-3, 1, 25)} # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)
print("Optimal α:", search.best_params_['model__alpha'])
7. Evaluate on the hold‑out set
RMSE expresses average cost error in dollars; R2R^2 indicates variance explained. Large α shrinks noisy columns to zero, leaving a concise set of interpretable cost drivers (e.g., a rush-hour indicator, heavy rain, or urban freeway class).
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
8. Inspect feature importance
The horizontal bar chart highlights the 20 most cost-sensitive factors, guiding planners toward high-leverage fixes—such as improving drainage on routes where “heavy rain” shows a strong positive coefficient.
# Recover one‑hot names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
coefs = search.best_estimator_.named_steps['model'].coef_
importance = (pd.Series(coefs, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,6))
importance.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Congestion Cost (Lasso Coefficients)')
plt.xlabel('Coefficient (USD change)')
plt.show()
Summary
This notebook demonstrates how Lasso regression can translate raw traffic logs into a precise dollar estimate of congestion loss and a ranked list of its root causes. Planners can rerun the pipeline quarterly with new data, adjust the value‑of‑time constant to local wage rates, and quickly detect whether recent infrastructure projects are actually cutting the economic burden of gridlock.