Pollution Control Cost Prediction using Bayesian Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Environmental agencies and industrial facility managers need to forecast the daily cost of air‑pollution control measures—before deploying scrubbers, absorbers, or process adjustments—using early readings of PM₂.₅, PM₁₀, NO₂, SO₂, and O₃ concentrations. Control costs escalate nonlinearly once pollutant levels exceed regulatory thresholds (e.g., emergency retrofits for PM₂.₅ spikes) and vary with weather and fuel prices. A simple point forecast ignores this volatility, risking budget overruns or under‑preparedness. Applying Bayesian Regression lets us generate both a point estimate of the mitigation cost and a credible interval quantifying our uncertainty—enabling data‑driven budgeting and rapid response planning.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Data Loading & Synthetic Cost Computation
We compute mitigation_cost by applying a unit rate per µg/m³ above regulatory thresholds.
import pandas as pd
# Load pollutant data and drop rows with any missing values
df = pd.read_csv("data/global_air_quality.csv")[['PM2.5','PM10','NO2','SO2','O3']].dropna()
# Define regulatory thresholds (µg/m³) and per-unit control cost ($/µg)
thresholds = {'PM2.5':12, 'PM10':50, 'NO2':100, 'SO2':75, 'O3':70}
unit_costs = {'PM2.5':0.50, 'PM10':0.30, 'NO2':0.40, 'SO2':0.35, 'O3':0.25}
# Compute daily mitigation cost: sum of (exceedance × unit cost)
def daily_cost(row):
cost = 0
for p in thresholds:
exceed = max(row[p] - thresholds[p], 0)
cost += exceed * unit_costs[p]
return cost
df['mitigation_cost'] = df.apply(daily_cost, axis=1)
# Prepare features and target
X = df[['PM2.5','PM10','NO2','SO2','O3']].values
y = df['mitigation_cost'].values
Preprocessing & Train/Test Split
Standardisation: Z‑scoring pollutant features ensures uniform priors and stable MCMC convergence.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split 80% train / 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
# Standardize pollutant concentrations for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α centred on the empirical mean cost ±2 σ,
- β ∼ Normal(0,1) for each standardised pollutant,
- σ ∼ HalfNormal(std of cost) reflecting noise.
Model: We assume mitigation_cost ∼ Normal(α + β·X_std, σ).
Sampling: We draw 2,000 posterior samples after 1,000 tuning steps, with target_accept=0.9 to ensure robust convergence.
import pymc3 as pm
with pm.Model() as model:
# Priors
α = pm.Normal("α", mu=y_train.mean(), sigma=y_train.std()*2)
β = pm.Normal("β", mu=0, sigma=1, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=y_train.std())
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample posterior
trace = pm.sample(
draws=2000, tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Posterior Predictive: Sampling Y_obs yields full predictive distributions; we extract point forecasts (posterior mean) and 94% Highest Posterior Density intervals.
- Evaluation: Mean Absolute Error on held‑out data quantifies the average prediction error in USD/day.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize parameter posteriors
az.summary(trace, round_to=2)
# Posterior predictive sampling
with model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Compute posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate performance
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per day")
Visualise Predictions & Credible Intervals
Sweeping PM₂.₅ concentration and holding other pollutants constant, we plot both the expected cost curve and its credible band—revealing how control costs escalate with pollution and the associated uncertainty.
import numpy as np
import matplotlib.pyplot as plt
# Sweep PM2.5; hold other pollutants at median
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
pm25_vals = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid[:,0] = pm25_vals
with model:
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"], samples=1000)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Inverse‐transform PM2.5
pm25_orig = scaler.inverse_transform(grid)[:,0]
plt.figure(figsize=(8,5))
plt.plot(pm25_orig, mean_pred, label="Posterior mean")
plt.fill_between(pm25_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0],
y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("PM₂.₅ Concentration (µg/m³)")
plt.ylabel("Daily Mitigation Cost (USD)")
plt.title("Bayesian Regression: Cost vs. PM₂.₅")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression workflow for Pollution Control Cost Prediction delivers:
1. Accurate point estimates of daily mitigation expense from real‑time pollutant readings.
2. Credible intervals that quantify forecasting uncertainty—critical for budget risk management.
3. Actionable insights: environmental agencies can allocate resources and schedule interventions with explicit cost‑risk bounds—optimising both public‑health outcomes and financial resilience.