Athlete Training Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Sports program directors and athletic departments need to forecast each athlete’s monthly training cost—before the training cycle begins—using early indicators such as planned training hours, average session intensity, coach‑to‑athlete ratio, and nutrition support level. Training costs per athlete grow nonlinearly with training load (e.g., bulk‑hour discounts at high volume, surcharges for high‑intensity coaching). They are subject to uncertainty from scheduling changes and variable resource rates. A single point‐estimate model masks this uncertainty, risking budget overruns or underutilization. By applying Bayesian Regression, we obtain:
1. A point estimate of the expected monthly training cost.
2. A credible interval that quantifies forecast uncertainty—enabling data‐driven budgeting, staffing, and resource allocation.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Import Libraries & Load Data
import pandas as pd
# Load the training‐cost data
df = pd.read_csv("data/sports-training-dataset/training_costs.csv")
# Preview columns
df[['planned_hours','avg_intensity','coach_ratio',
'nutrition_level','training_cost']].head()
Preprocessing & Train/Test Split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Define features and target
feature_cols = ['planned_hours','avg_intensity','coach_ratio','nutrition_level']
X = df[feature_cols].values
y = df['training_cost'].values # monthly cost in USD
# 80/20 train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize features for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(0, 1 000): broad intercept prior for base cost.
- β ∼ Normal(0, 500): moderate uncertainty on each standardised feature effect.
- σ ∼ HalfNormal(500): positive residual noise scale.
Model: Linear predictor: μ = α + β·X_standardized captures how features drive cost.
Likelihood: Observed training_cost ∼ Normal(μ, σ).
Inference: We draw 2,000 posterior samples (after 1,000 tuning) with target_accept=0.9 for stable convergence.
import pymc3 as pm
with pm.Model() as model:
# Priors for intercept and coefficients
α = pm.Normal("α", mu=0, sigma=1e3)
β = pm.Normal("β", mu=0, sigma=500, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=500)
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample the posterior
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Posterior prediction: Posterior predictive sampling provides full predictive distributions, enabling quantification of cost uncertainty.
- Evaluation: Posterior means of α and β yield point predictions; mean absolute error (MAE) on the test set quantifies average forecast error.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize posterior distributions
az.summary(trace, round_to=2)
# Posterior predictive sampling
with model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Compute posterior means of parameters
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate with MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions & Credible Intervals
By sweeping one key feature (planned_hours) and holding others at their median, we plot both the posterior mean cost curve and its 94% Highest Posterior Density interval—illustrating expected cost scaling and uncertainty.
import numpy as np
import matplotlib.pyplot as plt
# Vary planned_hours; hold other features at their median
hours_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,0] = hours_grid
with model:
pm.set_data({"X": grid})
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Convert planned_hours back to original scale
hours_orig = scaler.inverse_transform(
np.column_stack([grid[:,0], grid[:,1], grid[:,2], grid[:,3]])
)[:,0]
plt.figure(figsize=(8,5))
plt.plot(hours_orig, mean_pred, label="Posterior mean")
plt.fill_between(hours_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0], y_test,
color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Planned Training Hours")
plt.ylabel("Training Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Planned Hours")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression workflow for athlete training‐cost forecasting provides:
1. Point estimates of expected monthly training cost from planned hours, intensity, coach ratio, and nutrition level.
2. Credible intervals that quantify forecast uncertainty—enabling risk‐aware budgeting and resource allocation.
3. Actionable insights: program directors can set training budgets and staffing with confidence bounds, optimise session planning, and negotiate coaching rates under uncertainty.