Fitness Session Cost Prediction using Linear Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Personal-training studios, boutique group fitness brands, and franchise gyms all face the same pricing puzzle: “What should we charge for the next session?”
An intelligent guess depends on many moving parts—session length, format (e.g., HIIT vs. yoga), trainer seniority, class size, time of day, the client’s membership tier, and even day-of-week demand. In this hands-on walkthrough, we build a linear regression baseline that predicts a single session’s revenue (SessionCostUSD) from readily logged attributes.
Although most chains eventually adopt segmented or dynamic‑pricing engines, a transparent linear fit reveals exactly how strongly each lever tugs price and supplies the yardstick every fancier model must beat.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick EDA plots (optional)
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression first? Studios typically price sessions by adding surcharges to a base rate—longer workouts, smaller groups, peak‑time slots, and premium membership tiers all add dollars in a largely additive fashion. OLS captures this cleanly and spits out coefficients that managers can sanity‑check.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the dataset
Download gym_members_exercise_dataset.csv from Kaggle (link in § 4) and point to your local path:
df = pd.read_csv("gym_members_exercise_dataset.csv")
print(df.head())
Relevant columns (the dataset already contains a synthetic SessionCostUSD field):
| SessionCostUSD | 42.50 |
| SessionType | Cardio / Yoga / HIIT … |
| DurationMin | 55 |
| TrainerExperience | 4 # years |
| GroupSize | 10 |
| TimeOfDay | Morning / Afternoon / Eve |
| MemberTier | Pay‑As‑You‑Go / Gold / VIP |
| Weekday | Mon … Sun |
3. Minimal cleaning
StandardScaler gives comparable variance to duration, group size, and trainer seniority, making coefficients directly comparable.
core = ['SessionCostUSD','SessionType','DurationMin','TrainerExperience',
'GroupSize','TimeOfDay','MemberTier','Weekday']
df = df.dropna(subset=core).copy()
4. Define predictors & label
num_cols = ['DurationMin', 'TrainerExperience', 'GroupSize'] cat_cols = ['SessionType', 'TimeOfDay', 'MemberTier', 'Weekday'] target = 'SessionCostUSD' X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & model pipeline
ColumnTransformer cleanly pipelines scaling for numeric inputs and one‑hot encoding for categoricals; no risk of mismatched preprocessing at inference.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline(steps=[
('prep', preprocess),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
R² + MAE—together show both the share of price variance we capture and the average absolute pricing error (e.g., “we’re typically within ±$4.10 on a $45 session”).
y_pred = pipe.predict(X_test)p-roject
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):.2f} per session")
8. Interpret cost drivers
The coefficient table makes actionables obvious: if SessionType_HIIT adds $6 while TimeOfDay_Morning subtracts $2, revenue & scheduling can rebalance the timetable accordingly.
# recover encoded feature names
ohe_names = (pipe.named_steps['prep']
.named_transformers_['cat']
.get_feature_names_out(cat_cols))
all_feats = list(ohe_names) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_,
index=all_feats).sort_values()
print("\nDiscount factors (negative coefficients):")
print(coefs.head(6))
print("\nPremium factors (positive coefficients):")
print(coefs.tail(6))
Because numeric inputs are z‑scored, each coefficient reads as $/session change for a 1 σ shift in that feature; categorical one‑hots read as the dollar bump versus the reference level.
9. Persist the pipeline
Joblib persistence means tomorrow’s booking system or quoting API can call joblib.load(“fitness_session_cost_linreg.pkl”), feed in the next customer’s session details, and return a live price in milliseconds.
joblib.dump(pipe, "fitness_session_cost_linreg.pkl")
Summary
With fewer than 120 lines of Python, we turned raw gym‑booking logs into an explainable session‑pricing engine:
- Instant quotes—sales reps and booking widgets give customers a fair, data‑backed price on the spot.
- Crystal-clear price levers—owners see exactly how duration, intensity, trainer seniority, and membership tier affect the dollars.
Keep this interpretable baseline on file; when you test regularised regression, gradient‑boosted trees, or even reinforcement‑learning price optimisers, you’ll know precisely how much predictive (and dollar) uplift each layer of complexity delivers.