Library Usage Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Public library systems incur variable operational costs (staff time, materials, utilities) that correlate with patron activity, such as checkouts, computer station sessions, in‑branch visits, and program attendance. Administrators need to forecast the monthly cost of patron services—before finalising their next year’s budget—using early‐month usage metrics. Usage‐cost curves are nonlinear (e.g., high‐volume checkouts trigger overtime staffing, peak computer traffic may require extra licenses) and uncertain due to seasonal demand fluctuations. By applying Bayesian Regression, we obtain both a point estimate of library usage costs and a credible interval quantifying uncertainty—enabling data‑driven budgeting and resource planning.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Data Loading & Synthetic Cost Computation
- We assign unit costs to each usage metric—checkouts, visits, computer sessions, program attendance—and sum to form monthly_cost.
import pandas as pd
# Load monthly usage data
df = pd.read_csv("data/sf_library_usage_data.csv")
# Select usage columns and the month index
# (example columns; adjust to actual names)
df = df[['YEAR','MONTH',
'CHECKOUTS','IN_BRANCH_VISITS',
'COMPUTER_SESSIONS','PROGRAM_ATTENDANCE']]
# Compute a synthetic cost per activity (USD)
# - $0.10 per checkout
# - $0.05 per in‐branch visit
# - $0.20 per computer session
# - $1.00 per program attendee
df['monthly_cost'] = (
df['CHECKOUTS'] * 0.10 +
df['IN_BRANCH_VISITS'] * 0.05 +
df['COMPUTER_SESSIONS'] * 0.20 +
df['PROGRAM_ATTENDANCE'] * 1.00
)
# Create a datetime index for time ordering
df['DATE'] = pd.to_datetime(df[['YEAR','MONTH']].assign(DAY=1))
df = df.sort_values('DATE').reset_index(drop=True)
Train/Test Split & Scaling
- A chronological split ensures training on earlier months and testing on the most recent 20% of the data, mirroring real forecasting tasks.
- Z‑scores on the usage features stabilize MCMC sampling and make priors on β comparable across metrics.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Features & target X = df[['CHECKOUTS','IN_BRANCH_VISITS','COMPUTER_SESSIONS','PROGRAM_ATTENDANCE']].values y = df['monthly_cost'].values # Chronological split: train on first 80% months, test on last 20% split = int(len(df)*0.8) X_train, X_test = X[:split], X[split:] y_train, y_test = y[:split], y[split:] # Standardize features for stable MCMC sampling scaler = StandardScaler().fit(X_train) X_train_s = scaler.transform(X_train) X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Model specification
- α (intercept) and β (slopes) have Normal priors reflecting broad initial uncertainty.
- σ has a HalfNormal prior enforcing positive residual variation.
- Observed costs follow Normal(α + β·X_standardized, σ).
Sampling
We draw 2,000 posterior samples after 1,000 tuning iterations, with target_accept=0.9 to ensure robust convergence.
Posterior predictive
Sampling Y_obs yields full predictive distributions; we extract posterior means for point forecasts and 94% Highest Posterior Density intervals for uncertainty bounds.
import pymc3 as pm
with pm.Model() as lib_cost_model:
# Priors
α = pm.Normal("α", mu=0, sigma=100) # intercept prior
β = pm.Normal("β", mu=0, sigma=50, shape=X_train_s.shape[1])# slopes prior
σ = pm.HalfNormal("σ", sigma=20) # noise scale prior
# Expected monthly cost
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample posterior
trace = pm.sample(
draws=2000, # number of posterior draws
tune=1000, # burn‑in (tuning) draws
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Mean Absolute Error (MAE) on held‑out months quantifies average forecast error.
- Sweeping one feature (checkouts) while holding others fixed, we plot both the posterior mean cost curve and its credible band, enabling intuitive understanding of how usage drives cost—with quantified uncertainty.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize posterior distributions
az.summary(trace, round_to=2)
# Posterior predictive sampling
with lib_cost_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Extract posterior means for α and β
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point‐estimate predictions on test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions & Credible Intervals
import numpy as np
import matplotlib.pyplot as plt
# Vary CHECKOUTS while holding other features at median
checkout_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,0] = checkout_grid
with lib_cost_model:
ppc_grid = pm.sample_posterior_predictive(trace,
var_names=["Y_obs"],
samples=1000)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Back‑transform checkout counts
checkout_orig = scaler.inverse_transform(grid)[:,0]
plt.figure(figsize=(8,5))
plt.plot(checkout_orig, mean_pred, label="Posterior mean")
plt.fill_between(checkout_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0],
y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Monthly Checkouts")
plt.ylabel("Monthly Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Checkouts")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression workflow for Library Usage Cost Prediction provides:
1. Accurate point forecasts of monthly service costs from early usage metrics.
2. Credible intervals that capture uncertainty from seasonal volatility and data noise.
3. Actionable insights: library administrators can budget for staffing and materials with confidence bounds and understand which usage drivers (e.g., checkouts vs. computer sessions) most affect their operational costs.