Worker Output Prediction with Ridge Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Factory supervisors live and die by the daily output target stuck on the whiteboard. If a sewing line or assembly cell misses quota, late orders stack up, overtime balloons, and morale sinks; if it over‑produces, excess WIP ties up cash.
We want a pragmatic way to estimate tomorrow’s worker‑level output before the shift bell rings, using information that is already keyed into the MES at the end of each day:
- minutes spent on the production task (actual_productive_minutes)
- number of style changes handled (style_change_count)
- average WIP on the line
- day of week (some lines work half‑day on Fridays)
- team identifier (different group dynamics)
- quarter‑hour absenteeism rate
- operator skill band assigned by HR (A / B / C)
- whether today was an incentive‑bonus day
Because many of these predictors are correlated—lines with frequent style changes also suffer higher WIP—we’ll use Ridge regression, a linear model with L2 regularisation that keeps coefficients stable and readable.
Libraries Required
- pandas # reading & wrangling CSV
- numpy # numeric helpers
- matplotlib.pyplot # quick plots (optional)
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # model persistence
Dataset Link
Productivity Prediction of Garment Employees
Step by Step Code Implementation
1. Import packages
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import mean_absolute_error, r2_score import joblib
2. Load the Kaggle dataset
Factory metrics are collinear (many garments in longer SMV styles also show higher WIP). Plain OLS can blow up coefficients; Ridge tames them, giving a robust, still‑linear model.
df = pd.read_csv("garments_worker_productivity.csv") # file after unzip
print(df.head())
Columns you should see
| field | sample |
| productivity | 0.78 (label – fraction of quota hit) |
| observation_date | 2015‑01‑04 |
| quarter | Q1 |
| department | sewing |
| team | 3 |
| no_of_workers | 35 |
| no_of_style_change | 2 |
| wip | 420 |
| smv | 26.1 |
| incentive | 0 (1 = bonus day) |
| idle_men | 2 |
| idle_time | 92 |
| actual_productive_minutes | 425 |
The original target is % of the target quota achieved. We’ll convert that to pieces by multiplying by the target SMV (standard minute value):
# approximate daily output in “garment equivalents” df['daily_output_pcs'] = df['productivity'] * df['actual_productive_minutes'] / df['smv']
3. Define features and label
Cross‑validated Ridge automatically picks the α that minimises validation error—no manual tuning, guesswork.
target = 'daily_output_pcs'
num_cols = ['actual_productive_minutes', 'no_of_style_change',
'wip', 'idle_time', 'no_of_workers', 'smv', 'incentive']
cat_cols = ['department', 'team', 'quarter']
X = df[num_cols + cat_cols]
y = df[target]
4. Pre‑processing + Ridge pipeline
Numeric features vary wildly (minutes vs style changes). Scaling ensures Ridge’s L2 penalty treats them evenly; categories turn into binary flags without assumed order.
preproc = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
alphas = [0.1, 1.0, 10.0, 50.0, 100.0] # candidate regularisation strengths
ridge = RidgeCV(alphas=alphas, cv=5)
pipe = Pipeline([
('prep', preproc),
('model', ridge)
])
5. Train‑test split & fitting
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
6. Performance check
pred = pipe.predict(X_test)
print(f"α chosen by CV : {pipe.named_steps['model'].alpha_}")
print(f"R² on hold‑out : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out : {mean_absolute_error(y_test, pred):.2f} pieces")
7. What really drives output?
onehot = pipe.named_steps['prep'].named_transformers_['cat']
ohe_names = onehot.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coefs = pd.Series(pipe.named_steps['model'].coef_,
index=feature_names).sort_values()
print("\nTop positive drivers of output:")
print(coefs.tail(7))
print("\nTop negative drivers of output:")
print(coefs.head(7))
8. Persist for tomorrow’s scheduling tool
joblib.dump(pipe, "ridge_worker_output.pkl")
Summary
By piping simple preprocessing into Ridge regression, we’ve produced a transparent, production‑ready forecaster for next‑day worker output:
- Practical payoff: planners can see tomorrow’s likely shortfall or surplus before the shift schedule is finalised.
- Explainability: every driver shows its pieces‑per‑σ impact; no black‑box mysteries.
- Strong baseline: tree ensembles or time‑series nets must beat this MAE while still telling production managers a coherent story.