Pollution Control Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Environmental agencies and industrial facility managers need to forecast the daily cost of air‑pollution control measures—before deploying scrubbers, absorbers, or process adjustments—using early readings of PM₂.₅, PM₁₀, NO₂, SO₂, and O₃ concentrations. Control costs escalate nonlinearly once pollutant levels exceed regulatory thresholds (e.g., emergency retrofits for PM₂.₅ spikes) and vary with weather and fuel prices. A simple point forecast ignores this volatility, risking budget overruns or under‑preparedness. Applying Bayesian Regression lets us generate both a point estimate of the mitigation cost and a credible interval quantifying our uncertainty—enabling data‑driven budgeting and rapid response planning.

Libraries Required

import pandas as pd                              # data loading & manipulation  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error

Dataset

Global Air Quality Dataset

Step-by-Step Code Implementation

Data Loading & Synthetic Cost Computation

We compute mitigation_cost by applying a unit rate per µg/m³ above regulatory thresholds.

import pandas as pd

# Load pollutant data and drop rows with any missing values
df = pd.read_csv("data/global_air_quality.csv")[['PM2.5','PM10','NO2','SO2','O3']].dropna()

# Define regulatory thresholds (µg/m³) and per-unit control cost ($/µg)
thresholds = {'PM2.5':12, 'PM10':50, 'NO2':100, 'SO2':75, 'O3':70}
unit_costs =   {'PM2.5':0.50, 'PM10':0.30, 'NO2':0.40, 'SO2':0.35, 'O3':0.25}

# Compute daily mitigation cost: sum of (exceedance × unit cost)
def daily_cost(row):
    cost = 0
    for p in thresholds:
        exceed = max(row[p] - thresholds[p], 0)
        cost += exceed * unit_costs[p]
    return cost

df['mitigation_cost'] = df.apply(daily_cost, axis=1)

# Prepare features and target
X = df[['PM2.5','PM10','NO2','SO2','O3']].values
y = df['mitigation_cost'].values

 Preprocessing & Train/Test Split

Standardisation: Z‑scoring pollutant features ensures uniform priors and stable MCMC convergence.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split 80% train / 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

# Standardize pollutant concentrations for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit Bayesian Regression Model

Priors:

α centred on the empirical mean cost ±2 σ,
β ∼ Normal(0,1) for each standardised pollutant,
σ ∼ HalfNormal(std of cost) reflecting noise.

Model: We assume mitigation_cost ∼ Normal(α + β·X_std, σ).

Sampling: We draw 2,000 posterior samples after 1,000 tuning steps, with target_accept=0.9 to ensure robust convergence.

import pymc3 as pm

with pm.Model() as model:
    # Priors
    α = pm.Normal("α", mu=y_train.mean(), sigma=y_train.std()*2)
    β = pm.Normal("β", mu=0, sigma=1, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=y_train.std())

    # Linear predictor
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # Sample posterior
    trace = pm.sample(
        draws=2000, tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

Posterior Predictive: Sampling Y_obs yields full predictive distributions; we extract point forecasts (posterior mean) and 94% Highest Posterior Density intervals.
Evaluation: Mean Absolute Error on held‑out data quantifies the average prediction error in USD/day.

import arviz as az
from sklearn.metrics import mean_absolute_error

# Summarize parameter posteriors
az.summary(trace, round_to=2)

# Posterior predictive sampling
with model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Compute posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point predictions
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate performance
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per day")

Visualise Predictions & Credible Intervals

Sweeping PM₂.₅ concentration and holding other pollutants constant, we plot both the expected cost curve and its credible band—revealing how control costs escalate with pollution and the associated uncertainty.

import numpy as np
import matplotlib.pyplot as plt

# Sweep PM2.5; hold other pollutants at median
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
pm25_vals = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid[:,0] = pm25_vals

with model:
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"], samples=1000)

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Inverse‐transform PM2.5
pm25_orig = scaler.inverse_transform(grid)[:,0]

plt.figure(figsize=(8,5))
plt.plot(pm25_orig, mean_pred, label="Posterior mean")
plt.fill_between(pm25_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,0],
    y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("PM₂.₅ Concentration (µg/m³)")
plt.ylabel("Daily Mitigation Cost (USD)")
plt.title("Bayesian Regression: Cost vs. PM₂.₅")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression workflow for Pollution Control Cost Prediction delivers:

1. Accurate point estimates of daily mitigation expense from real‑time pollutant readings.

2. Credible intervals that quantify forecasting uncertainty—critical for budget risk management.

3. Actionable insights: environmental agencies can allocate resources and schedule interventions with explicit cost‑risk bounds—optimising both public‑health outcomes and financial resilience.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook

Pollution Control Cost Prediction using Bayesian Regression in ML

Libraries Required

Dataset

Step-by-Step Code Implementation

Data Loading & Synthetic Cost Computation

Preprocessing & Train/Test Split

Define & Fit Bayesian Regression Model

Posterior Analysis & Point Predictions

Visualise Predictions & Credible Intervals