Athlete Training Score Prediction using Linear Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Sports scientists regularly collect physiology and workload metrics—training hours, average heart rate, sleep quality, VO₂ max—to gauge how well an athlete is adapting to a programme. A quick, numeric “training score” (e.g., 0–100) makes it easy to spot under‑ or over‑training at a glance.

In this project, we develop a linear regression baseline that predicts an athlete’s next training session score based on yesterday’s workload and recovery data. While high-performance teams eventually graduate to non-linear or longitudinal models, a transparent linear fit reveals the first-order levers that raise or sink readiness.

Libraries Required

pandas # data wrangling
numpy # numerical helpers
matplotlib.pyplot # sanity‑check visuals
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Multimodal Athlete Performance Sensor Dataset

Step-by-Step Code Implementation

Why linear regression first? Training score typically rises roughly proportionally with restorative factors (sleep, protein) and drops linearly with fatigue markers (long hours, high resting HR). A straight‑line fit quantifies exact trade‑offs—e.g., “one extra hour’s sleep offsets 45 minutes of extra workload.”

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2.  Load the data

df = pd.read_csv("athlete_performance.csv")
print(df.head())

Example columns

column	description
session_date	ISO date of training session
training_hours	total on‑feet time (h)
avg_heart_rate	bpm
sleep_quality	1–5 subjective score
protein_intake_g	grams in the previous 24 h
resting_hr	Morning resting heart rate
vo2_max	mL kg⁻¹ min⁻¹
training_score_next	label – coach‑graded readiness 0–100

3. Feature engineering

Calendar cue (dayofweek) captures micro-cycles (e.g., Sunday as a rest day) with zero additional data feeds.

# convert date and derive calendar cues (optional)
df['session_date'] = pd.to_datetime(df['session_date'])
df['dayofweek']    = df['session_date'].dt.dayofweek  # 0‑Mon … 6‑Sun

# drop rows with missing critical cells
core = ['training_hours', 'avg_heart_rate', 'sleep_quality',
        'protein_intake_g', 'resting_hr', 'vo2_max',
        'training_score_next']
df = df.dropna(subset=core)

4. Define predictors & label

The coefficient table highlights key levers: if training_hours carries the heaviest negative weight, coaches may cap volume; if sleep_quality yields a positive coefficient, recovery protocols receive greater emphasis.

num_cols = ['training_hours', 'avg_heart_rate', 'sleep_quality',
            'protein_intake_g', 'resting_hr', 'vo2_max', 'dayofweek']

X = df[num_cols]
y = df['training_score_next']

5.  Pre‑processing & model pipeline

Standard scaling puts every numeric variable on equal footing, so coefficients read as score points per 1 σ change—a clean way to brief coaches.

preproc = ColumnTransformer([
        ('scale', StandardScaler(), num_cols)
])   # only numeric features, scaled for coefficient comparison

linreg = LinearRegression()

pipe = Pipeline(steps=[
        ('prep',  preproc),
        ('model', linreg)
])

6. Train‑test split & training

The train–test shuffle is acceptable here because each row represents an independent session with no look-ahead leakage; for time-series dependencies, use a sliding window instead.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Evaluation

Performance metrics – R² tells us how much variance we capture; MAE in points answers “How far off is our average guess?”—handy when 5 points equals “one difficulty zone.”

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} score points")

8. Interpret coefficients

coef_series = pd.Series(pipe.named_steps['model'].coef_,
                        index=num_cols).sort_values()

print("\nFactors lowering next‑session score:")
print(coef_series.head(3))

print("\nBoosters raising next‑session score:")
print(coef_series.tail(3))

A one‑standard‑deviation bump in sleep_quality, for instance, might lift tomorrow’s score by ~4 points, while every extra bpm of resting HR shaves off ~0.8 points.

9. Persist the trained pipeline

Model persistence with joblib allows real‑time dashboards to load the .pkl, scale today’s data identically, and spit out an actionable score in milliseconds.

joblib.dump(pipe, "athlete_training_score_linreg.pkl")

Summary

With little more than pandas and scikit‑learn, we transformed raw workload and recovery logs into an explainable training‑score forecaster. The linear model delivers:

Instant feedback on expected readiness for tomorrow’s session.
Transparent trade‑offs—showing exactly how sleep, nutrition, and workload tug performance up or down.

Keep this interpretable baseline as your compass; when you later add heart‑rate variability, GPS micro‑loads, or switch to gradient‑boosted trees, you’ll know precisely how much extra predictive punch the sophistication buys.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook