Worker Productivity Curve Prediction with Polynomial Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Operations managers and HR analysts need to forecast individual worker productivity scores based on early‑week indicators—hours logged, task completion counts, collaboration events, and digital‑tool usage—before the week ends so that they can adjust staffing and support in real time. Empirical data show that productivity growth over the week follows a nonlinear curve: gains may plateau or even dip due to fatigue or task complexity. A simple linear model underfits these dynamics, while an unconstrained high‑degree polynomial overfits noise. By applying Polynomial Regression (linear regression on polynomially expanded features) with Ridge regularisation, we can capture smooth productivity curves and deliver reliable forecasts for proactive workforce management.
Dataset
Remote Worker Productivity Dataset
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd # data manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualisation from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Load Libraries & Data
import pandas as pd
# Adjust filename as needed
df = pd.read_csv("data/remote_worker_productivity.csv")
# Preview relevant columns
df[['day_of_week','hours_worked','tasks_completed','meetings_count',
'tool_usage_minutes','productivity_score']].head()
3. Exploratory Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Productivity vs hours shows curvature
sns.scatterplot(x='hours_worked', y='productivity_score', data=df, alpha=0.5)
plt.title("Hours Worked vs Productivity")
plt.xlabel("Hours Worked")
plt.ylabel("Productivity Score")
plt.show()
4. Feature Engineering & Target
PolynomialFeatures expands inputs to include squares and interactions (e.g., hours_worked², hours_worked×tasks_completed), capturing curvature and synergy effects in productivity growth.
# Use day‑of‑week as numeric (1=Mon…7=Sun)
df['day_num'] = df['day_of_week'].map({
'Monday':1,'Tuesday':2,'Wednesday':3,
'Thursday':4,'Friday':5,'Saturday':6,'Sunday':7
})
# Predictor matrix and target vector
X = df[['day_num','hours_worked','tasks_completed',
'meetings_count','tool_usage_minutes']]
y = df['productivity_score']
5. Build Polynomial Regression Pipeline
Ridge regression applies shrinkage to control overfitting from high‑dimensional polynomial terms.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
6. Train/Test Split & Hyperparameter Search
- GridSearchCV tunes the polynomial degree (1–3) and Ridge α (10⁻³ to 10³) across 5‑fold CV, optimising for the lowest RMSE on held‑out folds.
- StandardScaler normalises each feature, so Ridge’s ℓ² penalty treats them equally.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree' : [1, 2, 3],
'ridge__alpha' : np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
7. Evaluate Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} points")
print(f"Test R² : {r2:.3f}")
8. Inspect Key Polynomial Coefficients
Coefficient inspection highlights which nonlinear or interaction terms most influence predicted productivity, guiding operational interventions (e.g., optimal hours/tasks balance).
# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=X.columns)
# Get Ridge coefficients
coefs = gs.best_estimator_.named_steps['ridge'].coef_
# Present top 10 by absolute value
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names)
top10 = coef_series.abs().sort_values(ascending=False).head(10)
plt.figure(figsize=(8,5))
top10.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Productivity")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
By integrating polynomial feature engineering with Ridge regularisation in a concise pipeline, we achieve:
- Accurate, nonlinear forecasts of worker productivity (low RMSE, strong R²).
- Controlled complexity to avoid overfitting while capturing essential curve effects (diminishing returns, synergy).
- Interpretability: the most influential polynomial terms (e.g., hours_worked², hours_worked×tasks_completed) reveal actionable levers for workforce planning and real‑time support.