Fitness Session Cost Prediction using Linear Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Personal-training studios, boutique group fitness brands, and franchise gyms all face the same pricing puzzle: “What should we charge for the next session?”

An intelligent guess depends on many moving parts—session length, format (e.g., HIIT vs. yoga), trainer seniority, class size, time of day, the client’s membership tier, and even day-of-week demand. In this hands-on walkthrough, we build a linear regression baseline that predicts a single session’s revenue (SessionCostUSD) from readily logged attributes.

Although most chains eventually adopt segmented or dynamic‑pricing engines, a transparent linear fit reveals exactly how strongly each lever tugs price and supplies the yardstick every fancier model must beat.

Libraries Required

pandas # data wrangling
numpy # numerical helpers
matplotlib.pyplot # quick EDA plots (optional)
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Gym Members Exercise Dataset

Step-by-Step Code Implementation

Why linear regression first? Studios typically price sessions by adding surcharges to a base rate—longer workouts, smaller groups, peak‑time slots, and premium membership tiers all add dollars in a largely additive fashion. OLS captures this cleanly and spits out coefficients that managers can sanity‑check.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the dataset

Download gym_members_exercise_dataset.csv from Kaggle (link in § 4) and point to your local path:

df = pd.read_csv("gym_members_exercise_dataset.csv")
print(df.head())

Relevant columns (the dataset already contains a synthetic SessionCostUSD field):

SessionCostUSD	42.50
SessionType	Cardio / Yoga / HIIT …
DurationMin	55
TrainerExperience	4 # years
GroupSize	10
TimeOfDay	Morning / Afternoon / Eve
MemberTier	Pay‑As‑You‑Go / Gold / VIP
Weekday	Mon … Sun

3. Minimal cleaning

StandardScaler gives comparable variance to duration, group size, and trainer seniority, making coefficients directly comparable.

core = ['SessionCostUSD','SessionType','DurationMin','TrainerExperience',
        'GroupSize','TimeOfDay','MemberTier','Weekday']
df   = df.dropna(subset=core).copy()

4. Define predictors & label

num_cols = ['DurationMin', 'TrainerExperience', 'GroupSize']
cat_cols = ['SessionType', 'TimeOfDay', 'MemberTier', 'Weekday']
target   = 'SessionCostUSD'

X = df[num_cols + cat_cols]
y = df[target]

5. Pre‑processing & model pipeline

ColumnTransformer cleanly pipelines scaling for numeric inputs and one‑hot encoding for categoricals; no risk of mismatched preprocessing at inference.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg   = LinearRegression()

pipe = Pipeline(steps=[
        ('prep',  preprocess),
        ('model', linreg)
])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Evaluation

R² + MAE—together show both the share of price variance we capture and the average absolute pricing error (e.g., “we’re typically within ±$4.10 on a $45 session”).

y_pred = pipe.predict(X_test)p-roject
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):.2f} per session")

8. Interpret cost drivers

The coefficient table makes actionables obvious: if SessionType_HIIT adds $6 while TimeOfDay_Morning subtracts $2, revenue & scheduling can rebalance the timetable accordingly.

# recover encoded feature names
ohe_names = (pipe.named_steps['prep']
                  .named_transformers_['cat']
                  .get_feature_names_out(cat_cols))

all_feats = list(ohe_names) + num_cols
coefs     = pd.Series(pipe.named_steps['model'].coef_,
                      index=all_feats).sort_values()

print("\nDiscount factors (negative coefficients):")
print(coefs.head(6))

print("\nPremium factors (positive coefficients):")
print(coefs.tail(6))

Because numeric inputs are z‑scored, each coefficient reads as $/session change for a 1 σ shift in that feature; categorical one‑hots read as the dollar bump versus the reference level.

9. Persist the pipeline

Joblib persistence means tomorrow’s booking system or quoting API can call joblib.load(“fitness_session_cost_linreg.pkl”), feed in the next customer’s session details, and return a live price in milliseconds.

joblib.dump(pipe, "fitness_session_cost_linreg.pkl")

Summary

With fewer than 120 lines of Python, we turned raw gym‑booking logs into an explainable session‑pricing engine:

Instant quotes—sales reps and booking widgets give customers a fair, data‑backed price on the spot.
Crystal-clear price levers—owners see exactly how duration, intensity, trainer seniority, and membership tier affect the dollars.

Keep this interpretable baseline on file; when you test regularised regression, gradient‑boosted trees, or even reinforcement‑learning price optimisers, you’ll know precisely how much predictive (and dollar) uplift each layer of complexity delivers.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook