Air Quality Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Municipal environmental agencies need to estimate the per‑day cost of air-quality mitigation measures—before allocating budgets for pollution control—using real-time pollutant readings, such as PM₂.₅, PM₁₀, NO₂, O₃, and SO₂. Mitigation costs (e.g., deploying scrubbers, adjusting traffic flows, issuing health advisories) scale nonlinearly with pollutant concentrations (e.g., costs escalate sharply once PM₂.₅ exceeds health thresholds). They are subject to uncertainty from sensor noise and weather variability. A single point‐estimate model hides this risk. By applying Bayesian Regression, we produce:

1. A point forecast of daily mitigation cost.

2. A credible interval quantifying uncertainty—enabling data‑driven budget allocation and response planning.

Libraries Required

import pandas as pd                              # data loading & manipulation  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error 

Dataset

Global Air Quality Dataset

Step-by-Step Code Implementation

Data Loading & Synthetic Cost Computation

We compute a daily mitigation cost by charging $20/µg m³ for PM₂.₅ exceedances above WHO thresholds and $5–$10/µg m³ for other pollutants, resulting in a continuous target that rises sharply as air quality worsens.

import pandas as pd
import numpy as np

# Load global air‐quality data
df = pd.read_csv("data/global_air_quality.csv")

# Select pollutant features and drop missing
pollutants = ['PM2.5', 'PM10', 'NO2', 'O3', 'SO2']
df = df[pollutants].dropna()

# Synthetic mitigation cost: assume $20 per µg/m³ above WHO threshold for PM2.5
# plus $10 per µg/m³ above threshold for other pollutants
thresholds = {'PM2.5':10, 'PM10':20, 'NO2':40, 'O3':100, 'SO2':20}
cost_factors = {'PM2.5':20, 'PM10':10, 'NO2':10, 'O3':5, 'SO2':5}

def compute_cost(row):
    cost = 0
    for p in pollutants:
        exceed = max(row[p] - thresholds[p], 0)
        cost += exceed * cost_factors[p]
    return cost

df['mitigation_cost'] = df.apply(compute_cost, axis=1)

# Features & target
X = df[pollutants].values
y = df['mitigation_cost'].values  # USD per day

Preprocessing & Train/Test Split

We z‑score pollutant features so that Normal(0,100) priors on slopes operate uniformly and the MCMC sampler converges stably.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 80/20 random split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize pollutant features for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit Bayesian Regression Model

Priors:

  • α ∼ Normal(0, 1 000) for baseline cost.
  • β ∼ Normal(0, 100) to allow moderate sensitivity per standardised pollutant.
  • σ ∼ HalfNormal(100) for measurement and model error.

Bayesian model: Observed cost ∼ Normal(α + β·X_std, σ).

Inference: We draw 2,000 posterior samples (after 1,000 burn‑in) with target_accept=0.9 for reliable convergence.

import pymc3 as pm

with pm.Model() as aq_cost_model:
    # Priors
    α = pm.Normal("α", mu=0, sigma=1000)                               # intercept
    β = pm.Normal("β", mu=0, sigma=100, shape=X_train_s.shape[1])      # slopes
    σ = pm.HalfNormal("σ", sigma=100)                                  # residual noise

    # Linear predictor
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # MCMC sampling
    trace = pm.sample(
        draws=2000,       # posterior draws
        tune=1000,        # burn‑in
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

  • Posterior predictive: Sampling yields full predictive distributions; we extract both the posterior mean forecast and the 94% Highest Posterior Density interval across PM₂.₅ values.
  • Evaluation: The Mean Absolute Error (MAE) on held‑out test data quantifies the average deviation of our cost forecasts.
import arviz as az
from sklearn.metrics import mean_absolute_error

# Summarize posterior
az.summary(trace, round_to=2)

# Posterior predictive sampling
with aq_cost_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per day")

Visualise Predictions & Credible Intervals

By sweeping PM₂.₅ concentration (the primary health concern) while holding other pollutants constant, we plot the expected mitigation cost curve and its credible band—showing how cost escalates with pollution and how uncertain that escalation is.

import numpy as np
import matplotlib.pyplot as plt

# Sweep PM2.5 while holding other pollutants at median
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
pm25_vals = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid[:,0] = pm25_vals

with aq_cost_model:
    ppc_grid = pm.sample_posterior_predictive(
        trace, var_names=["Y_obs"], samples=1000
    )

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Back‑transform PM2.5 to original scale
pm25_orig = scaler.inverse_transform(grid)[:,0]

plt.figure(figsize=(8,5))
plt.plot(pm25_orig, mean_pred, label="Posterior mean")
plt.fill_between(pm25_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,0],
    y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("PM2.5 (µg/m³)")
plt.ylabel("Mitigation Cost (USD/day)")
plt.title("Bayesian Regression: Cost vs. PM2.5 Concentration")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression pipeline for Air Quality Cost Prediction delivers:

1. Point forecasts of daily pollution‐mitigation cost from real‐time sensor readings.

2. Credible intervals quantifying uncertainty from sensor noise and environmental variability.

3. Actionable insights: environmental agencies can allocate budgets, schedule interventions, and communicate risk bounds to stakeholders—optimising public‐health spending under uncertainty.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *