Library Usage Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

Public library systems incur variable operational costs (staff time, materials, utilities) that correlate with patron activity, such as checkouts, computer station sessions, in‑branch visits, and program attendance. Administrators need to forecast the monthly cost of patron services—before finalising their next year’s budget—using early‐month usage metrics. Usage‐cost curves are nonlinear (e.g., high‐volume checkouts trigger overtime staffing, peak computer traffic may require extra licenses) and uncertain due to seasonal demand fluctuations. By applying Bayesian Regression, we obtain both a point estimate of library usage costs and a credible interval quantifying uncertainty—enabling data‑driven budgeting and resource planning.

Libraries Required

import pandas as pd                              # data loading & manipulation  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error

Dataset

San Francisco Library Usage

Step-by-Step Code Implementation

Data Loading & Synthetic Cost Computation

  • We assign unit costs to each usage metric—checkouts, visits, computer sessions, program attendance—and sum to form monthly_cost.
import pandas as pd

# Load monthly usage data
df = pd.read_csv("data/sf_library_usage_data.csv")

# Select usage columns and the month index
# (example columns; adjust to actual names)
df = df[['YEAR','MONTH',
         'CHECKOUTS','IN_BRANCH_VISITS',
         'COMPUTER_SESSIONS','PROGRAM_ATTENDANCE']]

# Compute a synthetic cost per activity (USD)
# - $0.10 per checkout
# - $0.05 per in‐branch visit
# - $0.20 per computer session
# - $1.00 per program attendee
df['monthly_cost'] = (
    df['CHECKOUTS'] * 0.10 +
    df['IN_BRANCH_VISITS'] * 0.05 +
    df['COMPUTER_SESSIONS'] * 0.20 +
    df['PROGRAM_ATTENDANCE'] * 1.00
)

# Create a datetime index for time ordering
df['DATE'] = pd.to_datetime(df[['YEAR','MONTH']].assign(DAY=1))
df = df.sort_values('DATE').reset_index(drop=True)

Train/Test Split & Scaling

  • A chronological split ensures training on earlier months and testing on the most recent 20% of the data, mirroring real forecasting tasks.
  • Z‑scores on the usage features stabilize MCMC sampling and make priors on β comparable across metrics.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features & target
X = df[['CHECKOUTS','IN_BRANCH_VISITS','COMPUTER_SESSIONS','PROGRAM_ATTENDANCE']].values
y = df['monthly_cost'].values

# Chronological split: train on first 80% months, test on last 20%
split = int(len(df)*0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Standardize features for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit Bayesian Regression Model

Model specification

  • α (intercept) and β (slopes) have Normal priors reflecting broad initial uncertainty.
  • σ has a HalfNormal prior enforcing positive residual variation.
  • Observed costs follow Normal(α + β·X_standardized, σ).

Sampling

We draw 2,000 posterior samples after 1,000 tuning iterations, with target_accept=0.9 to ensure robust convergence.

Posterior predictive

Sampling Y_obs yields full predictive distributions; we extract posterior means for point forecasts and 94% Highest Posterior Density intervals for uncertainty bounds.

import pymc3 as pm

with pm.Model() as lib_cost_model:
    # Priors
    α = pm.Normal("α", mu=0, sigma=100)                         # intercept prior
    β = pm.Normal("β", mu=0, sigma=50, shape=X_train_s.shape[1])# slopes prior
    σ = pm.HalfNormal("σ", sigma=20)                            # noise scale prior

    # Expected monthly cost
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # Sample posterior
    trace = pm.sample(
        draws=2000,      # number of posterior draws
        tune=1000,       # burn‑in (tuning) draws
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

  • Mean Absolute Error (MAE) on held‑out months quantifies average forecast error.
  • Sweeping one feature (checkouts) while holding others fixed, we plot both the posterior mean cost curve and its credible band, enabling intuitive understanding of how usage drives cost—with quantified uncertainty.
import arviz as az
from sklearn.metrics import mean_absolute_error

# Summarize posterior distributions
az.summary(trace, round_to=2)

# Posterior predictive sampling
with lib_cost_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Extract posterior means for α and β
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point‐estimate predictions on test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")

Visualise Predictions & Credible Intervals

import numpy as np
import matplotlib.pyplot as plt

# Vary CHECKOUTS while holding other features at median
checkout_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,0] = checkout_grid

with lib_cost_model:
    ppc_grid = pm.sample_posterior_predictive(trace,
                                              var_names=["Y_obs"],
                                              samples=1000)

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Back‑transform checkout counts
checkout_orig = scaler.inverse_transform(grid)[:,0]

plt.figure(figsize=(8,5))
plt.plot(checkout_orig, mean_pred, label="Posterior mean")
plt.fill_between(checkout_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,0],
    y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Monthly Checkouts")
plt.ylabel("Monthly Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Checkouts")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression workflow for Library Usage Cost Prediction provides:

1. Accurate point forecasts of monthly service costs from early usage metrics.

2. Credible intervals that capture uncertainty from seasonal volatility and data noise.

3. Actionable insights: library administrators can budget for staffing and materials with confidence bounds and understand which usage drivers (e.g., checkouts vs. computer sessions) most affect their operational costs.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *