Air Quality Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Municipal environmental agencies need to estimate the per‑day cost of air-quality mitigation measures—before allocating budgets for pollution control—using real-time pollutant readings, such as PM₂.₅, PM₁₀, NO₂, O₃, and SO₂. Mitigation costs (e.g., deploying scrubbers, adjusting traffic flows, issuing health advisories) scale nonlinearly with pollutant concentrations (e.g., costs escalate sharply once PM₂.₅ exceeds health thresholds). They are subject to uncertainty from sensor noise and weather variability. A single point‐estimate model hides this risk. By applying Bayesian Regression, we produce:
1. A point forecast of daily mitigation cost.
2. A credible interval quantifying uncertainty—enabling data‑driven budget allocation and response planning.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Data Loading & Synthetic Cost Computation
We compute a daily mitigation cost by charging $20/µg m³ for PM₂.₅ exceedances above WHO thresholds and $5–$10/µg m³ for other pollutants, resulting in a continuous target that rises sharply as air quality worsens.
import pandas as pd
import numpy as np
# Load global air‐quality data
df = pd.read_csv("data/global_air_quality.csv")
# Select pollutant features and drop missing
pollutants = ['PM2.5', 'PM10', 'NO2', 'O3', 'SO2']
df = df[pollutants].dropna()
# Synthetic mitigation cost: assume $20 per µg/m³ above WHO threshold for PM2.5
# plus $10 per µg/m³ above threshold for other pollutants
thresholds = {'PM2.5':10, 'PM10':20, 'NO2':40, 'O3':100, 'SO2':20}
cost_factors = {'PM2.5':20, 'PM10':10, 'NO2':10, 'O3':5, 'SO2':5}
def compute_cost(row):
cost = 0
for p in pollutants:
exceed = max(row[p] - thresholds[p], 0)
cost += exceed * cost_factors[p]
return cost
df['mitigation_cost'] = df.apply(compute_cost, axis=1)
# Features & target
X = df[pollutants].values
y = df['mitigation_cost'].values # USD per day
Preprocessing & Train/Test Split
We z‑score pollutant features so that Normal(0,100) priors on slopes operate uniformly and the MCMC sampler converges stably.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 80/20 random split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize pollutant features for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(0, 1 000) for baseline cost.
- β ∼ Normal(0, 100) to allow moderate sensitivity per standardised pollutant.
- σ ∼ HalfNormal(100) for measurement and model error.
Bayesian model: Observed cost ∼ Normal(α + β·X_std, σ).
Inference: We draw 2,000 posterior samples (after 1,000 burn‑in) with target_accept=0.9 for reliable convergence.
import pymc3 as pm
with pm.Model() as aq_cost_model:
# Priors
α = pm.Normal("α", mu=0, sigma=1000) # intercept
β = pm.Normal("β", mu=0, sigma=100, shape=X_train_s.shape[1]) # slopes
σ = pm.HalfNormal("σ", sigma=100) # residual noise
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# MCMC sampling
trace = pm.sample(
draws=2000, # posterior draws
tune=1000, # burn‑in
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Posterior predictive: Sampling yields full predictive distributions; we extract both the posterior mean forecast and the 94% Highest Posterior Density interval across PM₂.₅ values.
- Evaluation: The Mean Absolute Error (MAE) on held‑out test data quantifies the average deviation of our cost forecasts.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize posterior
az.summary(trace, round_to=2)
# Posterior predictive sampling
with aq_cost_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per day")
Visualise Predictions & Credible Intervals
By sweeping PM₂.₅ concentration (the primary health concern) while holding other pollutants constant, we plot the expected mitigation cost curve and its credible band—showing how cost escalates with pollution and how uncertain that escalation is.
import numpy as np
import matplotlib.pyplot as plt
# Sweep PM2.5 while holding other pollutants at median
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
pm25_vals = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid[:,0] = pm25_vals
with aq_cost_model:
ppc_grid = pm.sample_posterior_predictive(
trace, var_names=["Y_obs"], samples=1000
)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Back‑transform PM2.5 to original scale
pm25_orig = scaler.inverse_transform(grid)[:,0]
plt.figure(figsize=(8,5))
plt.plot(pm25_orig, mean_pred, label="Posterior mean")
plt.fill_between(pm25_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0],
y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("PM2.5 (µg/m³)")
plt.ylabel("Mitigation Cost (USD/day)")
plt.title("Bayesian Regression: Cost vs. PM2.5 Concentration")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression pipeline for Air Quality Cost Prediction delivers:
1. Point forecasts of daily pollution‐mitigation cost from real‐time sensor readings.
2. Credible intervals quantifying uncertainty from sensor noise and environmental variability.
3. Actionable insights: environmental agencies can allocate budgets, schedule interventions, and communicate risk bounds to stakeholders—optimising public‐health spending under uncertainty.