Traffic Congestion Cost Prediction with Lasso Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Urban gridlock wastes fuel, pollutes the air, and drains productivity. To help city planners put a price tag on that waste, we will build a Lasso‑regularised linear model that:

Predicts the economic cost (USD) of each congestion event based on delay length, traffic mix, weather, road type, and time‑of‑day features.
Pinpoints the handful of variables that drive most of the cost—thanks to Lasso’s ℓ1 penalty—which enables budget‑conscious interventions such as retiming signals or adding bus lanes.

The target variable will be a derived congestion cost measure (delay minutes × value of time × vehicles affected).

Libraries Required

Purpose	Library
Data wrangling	pandas, numpy
Visualisation	matplotlib, seaborn
Modelling pipeline	scikit‑learn → Lasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV
Metrics	mean_squared_error, r2_score

Dataset Link

US Traffic Congestions

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

The Kaggle file logs 5.9 million congestion events across 49 states, with timestamps, weather, road geometry, and delay metrics.

# One‑time shell command (needs Kaggle API credentials):
# kaggle datasets download -d sobhanmoosavi/us-traffic-congestions-2016-2022 -p data --unzip

data = pd.read_csv("data/traffic_congestion.csv")   # adjust filename if needed

3. Engineering a “congestion cost” target

Assigning a dollar value per vehicle‑minute converts abstract delay into a concrete cost that policy‑makers can act on. Multiplying by VehiclesAffected scales the impact from individual drivers to society‑level loss.

# Assume the raw dataset includes delay in seconds and an estimate of vehicle count
VALUE_OF_TIME = 0.30  # USD per vehicle‑minute (adjustable)

data['delay_minutes']   = data['Delay'] / 60        # convert to minutes
data['congestion_cost'] = data['delay_minutes'] * data['VehiclesAffected'] * VALUE_OF_TIME

y = data['congestion_cost']
X = data.drop(columns=['congestion_cost'])

4. Pre‑processing recipe

ColumnTransformer one‑hot‑encodes categorical attributes (road class, weather condition, state) and z‑scales numeric ones (temperature, humidity, traffic speed). This standardisation ensures Lasso’s penalty treats each variable fairly.

cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
    ('num', StandardScaler(), num_cols)
])

5. Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

6. Build & tune Lasso pipeline

Encapsulating pre‑processing and modelling in a single Pipeline prevents data leakage. A log‑spaced grid search of α balances sparsity and fit; five‑fold CV guards against seasonal quirks in the seven‑year dataset.

pipe = Pipeline([
    ('prep', preprocess),
    ('model', Lasso(max_iter=10_000, random_state=42))
])

param_grid = {'model__alpha': np.logspace(-3, 1, 25)}   # 0.001 → 10
search = GridSearchCV(pipe, param_grid, cv=5,
                      scoring='neg_root_mean_squared_error', n_jobs=-1)
search.fit(X_train, y_train)

print("Optimal α:", search.best_params_['model__alpha'])

7. Evaluate on the hold‑out set

RMSE expresses average cost error in dollars; R2R^2 indicates variance explained. Large α shrinks noisy columns to zero, leaving a concise set of interpretable cost drivers (e.g., a rush-hour indicator, heavy rain, or urban freeway class).

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

8.  Inspect feature importance

The horizontal bar chart highlights the 20 most cost-sensitive factors, guiding planners toward high-leverage fixes—such as improving drainage on routes where “heavy rain” shows a strong positive coefficient.

# Recover one‑hot names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

coefs = search.best_estimator_.named_steps['model'].coef_
importance = (pd.Series(coefs, index=feature_names)
                .sort_values(key=abs, ascending=False))

plt.figure(figsize=(9,6))
importance.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Congestion Cost (Lasso Coefficients)')
plt.xlabel('Coefficient (USD change)')
plt.show()

Summary

This notebook demonstrates how Lasso regression can translate raw traffic logs into a precise dollar estimate of congestion loss and a ranked list of its root causes. Planners can rerun the pipeline quarterly with new data, adjust the value‑of‑time constant to local wage rates, and quickly detect whether recent infrastructure projects are actually cutting the economic burden of gridlock.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

Traffic Congestion Cost Prediction with Lasso Regression in ML

Libraries Required

Dataset Link

Step-by-Step Code Implementation

1. Import Libraries

2. Download and load the dataset

3. Engineering a “congestion cost” target

4. Pre‑processing recipe

5. Train/test split

6. Build & tune Lasso pipeline

7. Evaluate on the hold‑out set

8. Inspect feature importance