Water Treatment Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Municipal water‐treatment managers need to forecast monthly treatment costs—before setting utility rates or planning maintenance—using early‐month indicators such as raw‐water volume (m³), average influent turbidity (NTU), chemical dosing rates (kg/day), and operational days. Treatment cost per cubic meter is nonlinear in these drivers (e.g., high turbidity demands exponentially more coagulant) and is subject to uncertainty from fluctuations in raw‐water quality and chemical price volatility. By applying Bayesian Regression, we obtain:

1. A point estimate of the monthly treatment cost.

2. A credible interval that captures forecast uncertainty—enabling risk‑aware rate-setting and budget allocation.

Libraries Required

import pandas as pd                              # data I/O & manipulation  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error  

Dataset

Water Consumption and Cost (2013–2023)

Step-by-Step Code Implementation

Data Loading & Cost Computation

We load monthly consumption, days, ambient temperature, and billed cost (as a treatment‐cost proxy)

Features:

1. consumption (m³) captures scale,

2. days adjusts for billing‐cycle length,

3. avg_temp proxies energy influences on process cost.

import pandas as pd

# Load monthly data
df = pd.read_csv("data/water_consumption_and_cost.csv", parse_dates=["Month"])
df = df.rename(columns={
    "Consumption_m3":"consumption",   # raw‐water volume
    "Days":"days",                    # billing days
    "Cost_USD":"treatment_cost",      # billed cost as proxy
    "AvgTemp_C":"avg_temp"            # ambient temp (proxy for process energy)
})
df.head()

Feature Engineering & Train/Test Split

We z‑score features so the Bayesian sampler converges efficiently.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features & target
X = df[["consumption","days","avg_temp"]].values
y = df["treatment_cost"].values  # USD

# Chronological split: first 80% months train, last 20% test
split = int(len(df) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Standardize features for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit Bayesian Regression Model

Priors:

  • α ∼ Normal(0, 1 000), broad intercept for cost scale,
  • β ∼ Normal(0, 500) for each standardized driver,
  • σ ∼ HalfNormal(500) for residual variability.

Model: treatment_cost ∼ Normal(α + β·X_standardized, σ).

MCMC: 2,000 posterior draws (after 1,000 tuning) with target_accept=0.9 ensure stable inference.

import pymc3 as pm

with pm.Model() as model:
    # Priors: intercept and slopes
    α = pm.Normal("α", mu=0, sigma=1e3)
    β = pm.Normal("β", mu=0, sigma=500, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=500)

    # Linear predictor
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood: observed treatment_cost
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # MCMC sampling
    trace = pm.sample(
        draws=2000,
        tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

Posterior Predictive: Sampling Y_obs yields full predictive distributions, from which we compute:

1. Posterior mean point forecasts,

2. 94% Highest Posterior Density intervals quantifying uncertainty.

Evaluation: Mean Absolute Error (MAE) on held‑out months quantifies point‑forecast accuracy.

import arviz as az
from sklearn.metrics import mean_absolute_error

# Posterior summary
az.summary(trace, round_to=2)

# Posterior predictive sampling
with model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")

Visualise Predictions & Credible Intervals

Sweeping raw‐water volume while holding other features fixed, we plot the treatment‐cost curve with its 94% credible band—illuminating both expected scale effects and uncertainty.

import numpy as np
import matplotlib.pyplot as plt

# Vary consumption; fix days & avg_temp at median
cons_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,0] = cons_grid

with model:
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"], samples=1000)

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Back‐transform consumption
cons_orig = scaler.inverse_transform(
    np.column_stack([grid[:,0], grid[:,1], grid[:,2]])
)[:,0]

plt.figure(figsize=(8,5))
plt.plot(cons_orig, mean_pred, label="Posterior mean")
plt.fill_between(cons_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,0], y_test,
    color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Raw‐Water Volume (m³)")
plt.ylabel("Monthly Treatment Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Water Volume")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression pipeline for Water Treatment Cost Prediction provides:

1. Point estimates of monthly treatment cost from early consumption and quality indicators.

2. Credible intervals quantifying forecast uncertainty from water‐quality fluctuations and cost volatility.

3. Actionable insights: water authorities can set rates, plan chemical procurement, and allocate maintenance budgets with both expected cost and uncertainty bounds—optimising both service reliability and financial resilience.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *