Clinic Operation Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Healthcare administrators need to forecast the daily operational costs of an outpatient clinic—before staffing and supply decisions are made—using early‑day indicators such as patient arrival rate, average consultation time, number of procedures, staff count, and supply usage. Clinic costs scale nonlinearly (e.g., high patient volume may trigger overtime pay, and procedure mix drives supply spikes) and are subject to uncertainty from no‑shows, emergency walk‑ins, and variable supply prices. A single point‑estimate hides this risk. By applying Bayesian Regression, we derive both:
1. A point forecast of expected daily clinic cost.
2. A credible interval quantifying our uncertainty—enabling risk‑aware staffing, supply ordering, and budget planning.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Medical Facility Operational Data
Step-by-Step Code Implementation
Data Loading & Synthetic Cost Computation
Synthetic target: We translate operational drivers into a daily_cost combining labour (consult min × $50/hr), procedures ($200 each), staff ($300/day), and supplies ($5/unit).
import pandas as pd
# Load operational data
df = pd.read_csv("data/medical-facility-operational-data.csv")
# Select and rename relevant fields
# Assume columns: 'Date','WalkIn_Count','Scheduled_Count','Avg_Consult_Min',
# 'Procedure_Count','Staff_Count','Supply_Units_Used'
df = df.rename(columns={
'WalkIn_Count': 'walkins',
'Scheduled_Count': 'scheduled',
'Avg_Consult_Min': 'consult_time',
'Procedure_Count': 'procedures',
'Staff_Count': 'staff',
'Supply_Units_Used':'supplies'
})
# Compute synthetic daily cost:
# - $50 per patient-minute of consultation (labor)
# - $200 per procedure (equipment & supplies)
# - $300 per staff member per day (salary/fringe)
# - $5 per supply unit used
df['daily_cost'] = (
(df['walkins'] + df['scheduled']) * df['consult_time'] * 50/60
+ df['procedures'] * 200
+ df['staff'] * 300
+ df['supplies'] * 5
)
# Features & target
features = ['walkins','scheduled','consult_time','procedures','staff','supplies']
X = df[features].values
y = df['daily_cost'].values # USD/day
Preprocessing & Train/Test Split
Scaling: Zero‑means and unit‑scales each feature so that priors on β (Normal(0,1)) apply uniformly and MCMC converges stably.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Random 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize features for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(mean_cost, sd_cost) centers the intercept on observed costs;
- β ∼ Normal(0,1) allows moderate sensitivity per standardised driver;
- σ ∼ HalfNormal(sd_cost) enforces positive residual noise.
Model: Observed cost ∼ Normal(α + β·X_std, σ).
Inference: 2,000 posterior samples (with 1,000 tuning) at target_accept=0.9 ensure robust exploration.
import pymc3 as pm
with pm.Model() as clinic_cost_model:
# Priors
α = pm.Normal("α", mu=y_train.mean(), sigma=y_train.std())
β = pm.Normal("β", mu=0, sigma=1, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=y_train.std())
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# MCMC sampling
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Posterior predictive: Sampling yields full predictive distributions, from which we extract posterior mean forecasts and 94% Highest Posterior Density intervals.
- Evaluation: MAE on held‑out days quantifies average cost‑forecast error.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize the posterior
az.summary(trace, round_to=2)
# Posterior predictive sampling
with clinic_cost_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Compute point predictions
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per day")
Visualise Predictions & Credible Intervals
By varying walk‑in count and holding other features fixed, we plot both the expected cost curve and its credible band—revealing how volume drives cost and our uncertainty around it.
import numpy as np
import matplotlib.pyplot as plt
# Sweep walkins while holding other features at median
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
walkin_vals = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid[:,0] = walkin_vals
with clinic_cost_model:
ppc_grid = pm.sample_posterior_predictive( trace,
var_names=["Y_obs"],
samples=1000)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Back‑transform walk‑in counts
walkin_orig = scaler.inverse_transform(grid)[:,0]
plt.figure(figsize=(8,5))
plt.plot(walkin_orig, mean_pred, label="Posterior mean")
plt.fill_between(walkin_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0],
y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Daily Walk‑In Count")
plt.ylabel("Daily Clinic Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Walk‑Ins")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression workflow for Clinic Operation Cost Prediction provides:
1. Point estimates of daily clinic cost from early operational metrics.
2. Credible intervals that quantify forecast uncertainty—crucial for staffing, supply ordering, and budget risk management.
3. Actionable insights: healthcare administrators can allocate resources, set fee schedules, and plan contingencies with explicit cost‑risk bounds—optimising both patient service and financial stewardship.