Fitness Program Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Wellness centres and corporate fitness providers need to forecast the per-participant cost of a multi-week fitness program—before launching the next cohort—using early-enrollment metrics such as age, body mass index (BMI), initial fitness score, program duration, and attendance commitment level. Delivery costs scale nonlinearly: older or higher‐BMI participants may require more individualised coaching (increasing labour cost), and longer programs often yield volume discounts on facility usage. Moreover, uncertainty in actual attendance rates and staff overtime means simple point estimates risk budget overruns. By applying Bayesian Regression, we obtain both a best-estimate cost per person and a credible interval that quantifies our uncertainty—enabling data-driven pricing, staffing, and resource allocation.

Libraries Required

import pandas as pd                              # data loading & manipulation  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error  

Dataset

Gym Membership Dataset

Step-by-Step Code Implementation

Data Loading & Feature Engineering

  • We convert a MonthlyFee into a per-program cost (Program_Cost ≈ MonthlyFee × weeks × 7/30).
  • Predictors: Age, BMI, initial fitness level, program length, and attendance commitment.
import pandas as pd

# Load simulated gym membership data
df = pd.read_csv("data/gym-membership-dataset/Gym_Membership_Data.csv")

# Assume the dataset includes:
#   Age, BMI, Initial_Fitness_Score, Program_Duration_Weeks, Attendance_Commitment (%),
#   MonthlyFee (USD)
df = df[['Age','BMI','Initial_Fitness_Score',
         'Program_Duration_Weeks','Attendance_Commitment','MonthlyFee']].dropna()

# Convert MonthlyFee to per‐week ProgramCost for a fair comparison
df['Program_Cost'] = df['MonthlyFee'] * (df['Program_Duration_Weeks'] * 7 / 30)

# Select predictors and target
X = df[['Age','BMI','Initial_Fitness_Score',
        'Program_Duration_Weeks','Attendance_Commitment']].values
y = df['Program_Cost'].values  # USD per participant

Preprocessing & Train/Test Split

Zero-mean and unit-scale each predictor so the Bayesian sampler converges reliably and the priors apply uniformly.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split data (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize predictors for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit Bayesian Regression Model

Model Priors:

  • α ∼ Normal(0, 50): broad intercept prior reflecting program‐cost scale.
  • β ∼ Normal(0, 20): moderate uncertainty on each standardised coefficient.
  • σ ∼ HalfNormal(20): positive residual‐noise scale.

Likelihood: Observed Program_Cost ∼ Normal(μ, σ), with μ=α+β·X_standardized.

Sampling: We draw 2,000 posterior samples (after 1,000 tuning) with target_accept=0.9 to ensure stable convergence.

import pymc3 as pm

with pm.Model() as fitness_cost_model:
    # Priors
    α = pm.Normal("α", mu=0, sigma=50)                                # intercept
    β = pm.Normal("β", mu=0, sigma=20, shape=X_train_s.shape[1])      # slopes
    σ = pm.HalfNormal("σ", sigma=20)                                  # residual noise

    # Linear predictor
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # MCMC sampling
    trace = pm.sample(
        draws=2000, tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

  • Posterior Predictive: Sampling Y_obs yields predictive distributions—allowing us to compute both point forecasts (posterior means) and 94% Highest Posterior Density intervals.
  • Evaluation: Mean Absolute Error (MAE) on held-out test data quantifies the average point-forecast error.
import arviz as az
from sklearn.metrics import mean_absolute_error

# Summarize posterior distributions
az.summary(trace, round_to=2)

# Posterior predictive sampling
with fitness_cost_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Compute posterior means of parameters
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")

Visualise Predictions & Credible Intervals

By sweeping attendance commitment (a key cost driver), we plot both the posterior mean program cost curve and its 94% credible band, illustrating how higher commitment reduces per‐participant cost—and how much uncertainty surrounds that estimate.

import numpy as np
import matplotlib.pyplot as plt

# Vary Attendance Commitment; hold other features at their median
commit_grid = np.linspace(X_train_s[:,4].min(), X_train_s[:,4].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,4] = commit_grid

with fitness_cost_model:
    ppc_grid = pm.sample_posterior_predictive(trace,
                                              var_names=["Y_obs"],
                                              samples=1000)

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Back-transform attendance commitment
commit_orig = scaler.inverse_transform(grid)[:,4]

plt.figure(figsize=(8,5))
plt.plot(commit_orig, mean_pred, label="Posterior mean")
plt.fill_between(commit_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,4],
    y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Attendance Commitment (%)")
plt.ylabel("Program Cost per Participant (USD)")
plt.title("Bayesian Regression: Cost vs. Attendance Commitment")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression workflow for Fitness Program Cost Prediction delivers:

1. Accurate point estimates of participant‐level program cost from early enrollment metrics.

2. Credible intervals that quantify forecasting uncertainty—crucial for budget risk management.

3. Actionable insights: fitness operators can set program prices, allocate coaching staff, and negotiate facility contracts with full awareness of cost bounds—optimising both profitability and participant outcomes.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *