Water Usage Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Utility managers and building operators need to forecast the monthly water‐usage cost—before the billing cycle closes—using early‐month indicators such as total water consumption (m³), average daily temperature (°C), number of billed days, and peak‐day usage. Water tariffs often include block rates (nonlinear per‑unit costs) and fixed service fees, and actual costs vary with weather and consumption patterns. A point‐estimate regression misses this uncertainty, risking budget overruns or shortfalls. By applying Bayesian Regression, we obtain:
1. A point estimate for the expected monthly cost.
2. A credible interval that quantifies forecast uncertainty—enabling data‐driven budgeting, rate‐setting, and demand‐management decisions.
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Water Consumption and Cost (2013–2023)
Step-by-Step Code Implementation
Import Libraries & Load Data
We load monthly water‐usage records with consumption (m³), billing days, average temperature, and billed cost.
import pandas as pd
# Load monthly data
df = pd.read_csv("data/water_consumption_and_cost.csv", parse_dates=["Month"])
df = df.rename(columns={
"Consumption_m3":"consumption",
"Cost_USD":"cost",
"Days":"days",
"AvgTemp_C":"avg_temp"
})
df.head()
Feature Engineering & Train/Test Split
Features:
- Consumption (total m³) captures the scale of use.
- days (number of billed days) adjusts for the billing cycle length.
- avg_temp (°C) proxies for the influence of weather on water use.
StandardScaler: We z‑score features so priors on β are comparable across dimensions.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Define predictors and target # Features: monthly consumption (m³), billing days, average temp X = df[["consumption","days","avg_temp"]].values y = df["cost"].values # USD # Chronological split: first 80% months train, last 20% test split = int(len(df) * 0.8) X_train, X_test = X[:split], X[split:] y_train, y_test = y[:split], y[split:] # Standardize features for stable MCMC scaler = StandardScaler().fit(X_train) X_train_s = scaler.transform(X_train) X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(0, 1 000): broad prior for baseline cost.
- β ∼ Normal(0, 500): reflects moderate uncertainty on each standardised coefficient.
- σ ∼ HalfNormal(500): encodes residual variability in cost.
Model: Linear predictor μ = α + β·X_standardized; observed cost ∼ Normal(μ, σ).
MCMC Sampling: We draw 2,000 posterior samples (post 1,000 tuning) with target_accept=0.9 for stable inference.
import pymc3 as pm
with pm.Model() as water_model:
# Broad priors reflecting initial uncertainty
α = pm.Normal("α", mu=0, sigma=1e3)
β = pm.Normal("β", mu=0, sigma=500, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=500)
# Linear predictor in standardized space
μ = α + pm.math.dot(X_train_s, β)
# Likelihood: observed monthly cost
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# MCMC sampling
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Posterior Predictive: Sampling from Y_obs yields predictive distributions, from which we compute point forecasts (posterior means) and 94% Highest Posterior Density intervals.
- Evaluation: Mean Absolute Error (MAE) quantifies the average deviation of point predictions on held‑out months.
import arviz as az
# Summarize posterior distributions
az.summary(trace, round_to=2)
# Posterior predictive sampling
with water_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Compute point predictions on the test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate mean absolute error
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions & Credible Intervals
Varying consumption (holding other features fixed), we plot the posterior mean cost curve and its 94% credible band—illustrating both the expected cost scaling and uncertainty due to model and data variability.
import numpy as np
import matplotlib.pyplot as plt
# Vary consumption; fix days and avg_temp at median
cons_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,0] = cons_grid
with water_model:
pm.set_data({"X": grid})
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Convert consumption back to original scale
cons_orig = scaler.inverse_transform(
np.column_stack([grid[:,0], grid[:,1], grid[:,2]])
)[:,0]
plt.figure(figsize=(8,5))
plt.plot(cons_orig, mean_pred, label="Posterior mean")
plt.fill_between(cons_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0], y_test,
color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Monthly Consumption (m³)")
plt.ylabel("Monthly Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Consumption")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression workflow for water‐usage cost forecasting provides:
1. Point estimates of monthly water cost from early‐month consumption, days, and temperature data.
2. Credible intervals that quantify forecast uncertainty—critical for risk‐aware budgeting and demand management.
3. Actionable insights: utility managers and building operators gain both the expected cost and its uncertainty bounds, enabling proactive rate decisions, conservation incentives, and capital planning.