Water Treatment Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Municipal water‐treatment managers need to forecast monthly treatment costs—before setting utility rates or planning maintenance—using early‐month indicators such as raw‐water volume (m³), average influent turbidity (NTU), chemical dosing rates (kg/day), and operational days. Treatment cost per cubic meter is nonlinear in these drivers (e.g., high turbidity demands exponentially more coagulant) and is subject to uncertainty from fluctuations in raw‐water quality and chemical price volatility. By applying Bayesian Regression, we obtain:
1. A point estimate of the monthly treatment cost.
2. A credible interval that captures forecast uncertainty—enabling risk‑aware rate-setting and budget allocation.
Libraries Required
import pandas as pd # data I/O & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Water Consumption and Cost (2013–2023)
Step-by-Step Code Implementation
Data Loading & Cost Computation
We load monthly consumption, days, ambient temperature, and billed cost (as a treatment‐cost proxy)
Features:
1. consumption (m³) captures scale,
2. days adjusts for billing‐cycle length,
3. avg_temp proxies energy influences on process cost.
import pandas as pd
# Load monthly data
df = pd.read_csv("data/water_consumption_and_cost.csv", parse_dates=["Month"])
df = df.rename(columns={
"Consumption_m3":"consumption", # raw‐water volume
"Days":"days", # billing days
"Cost_USD":"treatment_cost", # billed cost as proxy
"AvgTemp_C":"avg_temp" # ambient temp (proxy for process energy)
})
df.head()
Feature Engineering & Train/Test Split
We z‑score features so the Bayesian sampler converges efficiently.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Features & target X = df[["consumption","days","avg_temp"]].values y = df["treatment_cost"].values # USD # Chronological split: first 80% months train, last 20% test split = int(len(df) * 0.8) X_train, X_test = X[:split], X[split:] y_train, y_test = y[:split], y[split:] # Standardize features for stable MCMC sampling scaler = StandardScaler().fit(X_train) X_train_s = scaler.transform(X_train) X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(0, 1 000), broad intercept for cost scale,
- β ∼ Normal(0, 500) for each standardized driver,
- σ ∼ HalfNormal(500) for residual variability.
Model: treatment_cost ∼ Normal(α + β·X_standardized, σ).
MCMC: 2,000 posterior draws (after 1,000 tuning) with target_accept=0.9 ensure stable inference.
import pymc3 as pm
with pm.Model() as model:
# Priors: intercept and slopes
α = pm.Normal("α", mu=0, sigma=1e3)
β = pm.Normal("β", mu=0, sigma=500, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=500)
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood: observed treatment_cost
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# MCMC sampling
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
Posterior Predictive: Sampling Y_obs yields full predictive distributions, from which we compute:
1. Posterior mean point forecasts,
2. 94% Highest Posterior Density intervals quantifying uncertainty.
Evaluation: Mean Absolute Error (MAE) on held‑out months quantifies point‑forecast accuracy.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Posterior summary
az.summary(trace, round_to=2)
# Posterior predictive sampling
with model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions & Credible Intervals
Sweeping raw‐water volume while holding other features fixed, we plot the treatment‐cost curve with its 94% credible band—illuminating both expected scale effects and uncertainty.
import numpy as np
import matplotlib.pyplot as plt
# Vary consumption; fix days & avg_temp at median
cons_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,0] = cons_grid
with model:
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"], samples=1000)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Back‐transform consumption
cons_orig = scaler.inverse_transform(
np.column_stack([grid[:,0], grid[:,1], grid[:,2]])
)[:,0]
plt.figure(figsize=(8,5))
plt.plot(cons_orig, mean_pred, label="Posterior mean")
plt.fill_between(cons_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0], y_test,
color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Raw‐Water Volume (m³)")
plt.ylabel("Monthly Treatment Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Water Volume")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression pipeline for Water Treatment Cost Prediction provides:
1. Point estimates of monthly treatment cost from early consumption and quality indicators.
2. Credible intervals quantifying forecast uncertainty from water‐quality fluctuations and cost volatility.
3. Actionable insights: water authorities can set rates, plan chemical procurement, and allocate maintenance budgets with both expected cost and uncertainty bounds—optimising both service reliability and financial resilience.