Athlete Training Score Prediction using Linear Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Sports scientists regularly collect physiology and workload metrics—training hours, average heart rate, sleep quality, VO₂ max—to gauge how well an athlete is adapting to a programme. A quick, numeric “training score” (e.g., 0–100) makes it easy to spot under‑ or over‑training at a glance.
In this project, we develop a linear regression baseline that predicts an athlete’s next training session score based on yesterday’s workload and recovery data. While high-performance teams eventually graduate to non-linear or longitudinal models, a transparent linear fit reveals the first-order levers that raise or sink readiness.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # sanity‑check visuals
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Multimodal Athlete Performance Sensor Dataset
Step-by-Step Code Implementation
Why linear regression first? Training score typically rises roughly proportionally with restorative factors (sleep, protein) and drops linearly with fatigue markers (long hours, high resting HR). A straight‑line fit quantifies exact trade‑offs—e.g., “one extra hour’s sleep offsets 45 minutes of extra workload.”
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
df = pd.read_csv("athlete_performance.csv")
print(df.head())
Example columns
| column | description |
| session_date | ISO date of training session |
| training_hours | total on‑feet time (h) |
| avg_heart_rate | bpm |
| sleep_quality | 1–5 subjective score |
| protein_intake_g | grams in the previous 24 h |
| resting_hr | Morning resting heart rate |
| vo2_max | mL kg⁻¹ min⁻¹ |
| training_score_next | label – coach‑graded readiness 0–100 |
3. Feature engineering
Calendar cue (dayofweek) captures micro-cycles (e.g., Sunday as a rest day) with zero additional data feeds.
# convert date and derive calendar cues (optional)
df['session_date'] = pd.to_datetime(df['session_date'])
df['dayofweek'] = df['session_date'].dt.dayofweek # 0‑Mon … 6‑Sun
# drop rows with missing critical cells
core = ['training_hours', 'avg_heart_rate', 'sleep_quality',
'protein_intake_g', 'resting_hr', 'vo2_max',
'training_score_next']
df = df.dropna(subset=core)
4. Define predictors & label
The coefficient table highlights key levers: if training_hours carries the heaviest negative weight, coaches may cap volume; if sleep_quality yields a positive coefficient, recovery protocols receive greater emphasis.
num_cols = ['training_hours', 'avg_heart_rate', 'sleep_quality',
'protein_intake_g', 'resting_hr', 'vo2_max', 'dayofweek']
X = df[num_cols]
y = df['training_score_next']
5. Pre‑processing & model pipeline
Standard scaling puts every numeric variable on equal footing, so coefficients read as score points per 1 σ change—a clean way to brief coaches.
preproc = ColumnTransformer([
('scale', StandardScaler(), num_cols)
]) # only numeric features, scaled for coefficient comparison
linreg = LinearRegression()
pipe = Pipeline(steps=[
('prep', preproc),
('model', linreg)
])
6. Train‑test split & training
The train–test shuffle is acceptable here because each row represents an independent session with no look-ahead leakage; for time-series dependencies, use a sliding window instead.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
Performance metrics – R² tells us how much variance we capture; MAE in points answers “How far off is our average guess?”—handy when 5 points equals “one difficulty zone.”
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} score points")
8. Interpret coefficients
coef_series = pd.Series(pipe.named_steps['model'].coef_,
index=num_cols).sort_values()
print("\nFactors lowering next‑session score:")
print(coef_series.head(3))
print("\nBoosters raising next‑session score:")
print(coef_series.tail(3))
A one‑standard‑deviation bump in sleep_quality, for instance, might lift tomorrow’s score by ~4 points, while every extra bpm of resting HR shaves off ~0.8 points.
9. Persist the trained pipeline
Model persistence with joblib allows real‑time dashboards to load the .pkl, scale today’s data identically, and spit out an actionable score in milliseconds.
joblib.dump(pipe, "athlete_training_score_linreg.pkl")
Summary
With little more than pandas and scikit‑learn, we transformed raw workload and recovery logs into an explainable training‑score forecaster. The linear model delivers:
- Instant feedback on expected readiness for tomorrow’s session.
- Transparent trade‑offs—showing exactly how sleep, nutrition, and workload tug performance up or down.
Keep this interpretable baseline as your compass; when you later add heart‑rate variability, GPS micro‑loads, or switch to gradient‑boosted trees, you’ll know precisely how much extra predictive punch the sophistication buys.