Water Usage Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Utility managers and building operators need to forecast the monthly water‐usage cost—before the billing cycle closes—using early‐month indicators such as total water consumption (m³), average daily temperature (°C), number of billed days, and peak‐day usage. Water tariffs often include block rates (nonlinear per‑unit costs) and fixed service fees, and actual costs vary with weather and consumption patterns. A point‐estimate regression misses this uncertainty, risking budget overruns or shortfalls. By applying Bayesian Regression, we obtain:

1. A point estimate for the expected monthly cost.

2. A credible interval that quantifies forecast uncertainty—enabling data‐driven budgeting, rate‐setting, and demand‐management decisions.

Libraries Required

import pandas as pd                              # data loading & handling  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error

Dataset

Water Consumption and Cost (2013–2023)

Step-by-Step Code Implementation

Import Libraries & Load Data

We load monthly water‐usage records with consumption (m³), billing days, average temperature, and billed cost.

import pandas as pd

# Load monthly data
df = pd.read_csv("data/water_consumption_and_cost.csv", parse_dates=["Month"])
df = df.rename(columns={
    "Consumption_m3":"consumption",
    "Cost_USD":"cost",
    "Days":"days",
    "AvgTemp_C":"avg_temp"
})
df.head()

Feature Engineering & Train/Test Split

Features:

Consumption (total m³) captures the scale of use.
days (number of billed days) adjusts for the billing cycle length.
avg_temp (°C) proxies for the influence of weather on water use.

StandardScaler: We z‑score features so priors on β are comparable across dimensions.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define predictors and target
# Features: monthly consumption (m³), billing days, average temp
X = df[["consumption","days","avg_temp"]].values
y = df["cost"].values  # USD

# Chronological split: first 80% months train, last 20% test
split = int(len(df) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Standardize features for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit Bayesian Regression Model

Priors:

α ∼ Normal(0, 1 000): broad prior for baseline cost.
β ∼ Normal(0, 500): reflects moderate uncertainty on each standardised coefficient.
σ ∼ HalfNormal(500): encodes residual variability in cost.

Model: Linear predictor μ = α + β·X_standardized; observed cost ∼ Normal(μ, σ).

MCMC Sampling: We draw 2,000 posterior samples (post 1,000 tuning) with target_accept=0.9 for stable inference.

import pymc3 as pm

with pm.Model() as water_model:
    # Broad priors reflecting initial uncertainty
    α = pm.Normal("α", mu=0, sigma=1e3)
    β = pm.Normal("β", mu=0, sigma=500, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=500)

    # Linear predictor in standardized space
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood: observed monthly cost
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # MCMC sampling
    trace = pm.sample(
        draws=2000,
        tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

Posterior Predictive: Sampling from Y_obs yields predictive distributions, from which we compute point forecasts (posterior means) and 94% Highest Posterior Density intervals.
Evaluation: Mean Absolute Error (MAE) quantifies the average deviation of point predictions on held‑out months.

import arviz as az

# Summarize posterior distributions
az.summary(trace, round_to=2)

# Posterior predictive sampling
with water_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Compute point predictions on the test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate mean absolute error
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")

Visualise Predictions & Credible Intervals

Varying consumption (holding other features fixed), we plot the posterior mean cost curve and its 94% credible band—illustrating both the expected cost scaling and uncertainty due to model and data variability.

import numpy as np
import matplotlib.pyplot as plt

# Vary consumption; fix days and avg_temp at median
cons_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,0] = cons_grid

with water_model:
    pm.set_data({"X": grid})
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Convert consumption back to original scale
cons_orig = scaler.inverse_transform(
    np.column_stack([grid[:,0], grid[:,1], grid[:,2]])
)[:,0]

plt.figure(figsize=(8,5))
plt.plot(cons_orig, mean_pred, label="Posterior mean")
plt.fill_between(cons_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,0], y_test,
    color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Monthly Consumption (m³)")
plt.ylabel("Monthly Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Consumption")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression workflow for water‐usage cost forecasting provides:

1. Point estimates of monthly water cost from early‐month consumption, days, and temperature data.

2. Credible intervals that quantify forecast uncertainty—critical for risk‐aware budgeting and demand management.

3. Actionable insights: utility managers and building operators gain both the expected cost and its uncertainty bounds, enabling proactive rate decisions, conservation incentives, and capital planning.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook