Emission Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Environmental managers and fleet operators need to forecast the regulatory cost of greenhouse‑gas emissions—before budgeting for compliance—using vehicle or process features such as engine size, fuel type, vehicle weight, and annual mileage. Emission costs are computed as CO₂ emitted × carbon tax rate, so small changes in engine characteristics yield nonlinear cost effects (e.g. larger engines emit disproportionately more), and future costs carry uncertainty from fluctuating carbon prices. By applying Bayesian Regression, we not only predict a best estimate of emission cost per vehicle (or process) but also derive credible intervals that capture forecast uncertainty—enabling robust compliance budgeting and cap-and-trade strategy.
Libraries Required
import pandas as pd # data I/O import numpy as np # numerics import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Data Loading & Cost Computation
Cost Computation: We convert vehicle CO₂ emissions (g/km) into tons per year using a standard mileage assumption, then multiply by the carbon tax rate to obtain the annual cost.
Data Prep:
- Numeric features (Engine HP, Engine Cylinders) are standardised for stable sampling.
- Categorical features (E.g., Fuel Type) are one‑hot encoded to capture discrete effects.
import pandas as pd
# Load raw vehicle data
df = pd.read_csv("data/Car_data.csv")
# Compute annual emission cost: CO2 emissions (g/km) × annual_km × tax_rate ($ per ton)
# Assume average 15,000 km/year and carbon tax of $50 per metric ton (1 ton = 1e6 g)
annual_km = 15000
tax_rate = 50.0 # USD per ton
# Convert g/km to grams per year, then to tons
df["CO2_per_year_tons"] = (df["CO2 Emission g/km"] * annual_km) / 1e6
df["emission_cost"] = df["CO2_per_year_tons"] * tax_rate
# Select predictors and target
features = ["Engine HP", "Engine Cylinders", "Fuel Type", "Transmission", "Vehicle_Class"]
df = pd.get_dummies(df[features + ["emission_cost"]], drop_first=True)
X = df.drop(columns="emission_cost").values
y = df["emission_cost"].values
Train/Test Split & Scaling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize numeric columns (first two: HP and Cylinders)
scaler = StandardScaler().fit(X_train[:, :2])
X_train_s = X_train.copy()
X_train_s[:, :2] = scaler.transform(X_train[:, :2])
X_test_s = X_test.copy()
X_test_s[:, :2] = scaler.transform(X_test[:, :2])
Define & Fit Bayesian Regression Model
Bayesian Model:
- α (Intercept) and β (Coefficients) have weakly informative Normal priors.
- σ (Residual noise) has a Half‑Normal before enforcing positivity.
- Observed emission_cost is modelled as Normal(μ, σ).
MCMC Sampling: We draw 2,000 posterior samples (plus 1,000 burn-in) with target_accept=0.9 to ensure good convergence.
import pymc3 as pm
with pm.Model() as emission_cost_model:
# Priors
α = pm.Normal("α", mu=0, sigma=100) # baseline cost
β = pm.Normal("β", mu=0, sigma=20, shape=X_train_s.shape[1]) # coefficients
σ = pm.HalfNormal("σ", sigma=50) # noise scale
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample posterior
trace = pm.sample(
draws=2000, tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Predictions
- Posterior Predictive: Sampling Y_obs yields a full distribution over emission‐cost forecasts, from which we extract the posterior mean and 94% credible intervals.
- Evaluation: Mean Absolute Error (MAE) on held‑out data quantifies point‐forecast accuracy.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Posterior summary
az.summary(trace, round_to=2)
# Posterior predictive sampling
with emission_cost_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Point predictions using posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per vehicle per year")
Visualise Credible Interval vs. Engine HP
Sweeping Engine HP while holding other features constant, we plot the posterior mean cost curve with its 94% credible band—illustrating both expected cost scaling and uncertainty.
import numpy as np
import matplotlib.pyplot as plt
# Sweep Engine HP from 50 to 400, hold others at median
hp_vals = np.linspace(50, 400, 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
# back-transform hp_vals to standardized scale
hp_std = (hp_vals - scaler.mean_[0]) / scaler.scale_[0]
grid[:,0] = hp_std
with emission_cost_model:
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"], samples=1000)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Plot
plt.figure(figsize=(8,5))
plt.plot(hp_vals, mean_pred, label="Posterior mean")
plt.fill_between(hp_vals, hpd[:,0], hpd[:,1], alpha=0.3, label="94% CI")
plt.xlabel("Engine HP")
plt.ylabel("Annual Emission Cost (USD)")
plt.title("Emission Cost vs. Engine HP")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression pipeline for Emission Cost Prediction provides:
1. Point estimates of annual compliance cost per vehicle from early technical specs.
2. Credible intervals quantifying uncertainty from mileage assumptions and carbon‐price volatility.
3. Actionable insights: fleet managers can budget for carbon tax obligations using upper/lower bounds and evaluate technical trade-offs (e.g., downsizing engines) under uncertainty.