Worker Output Prediction with Ridge Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Factory supervisors live and die by the daily output target stuck on the whiteboard. If a sewing line or assembly cell misses quota, late orders stack up, overtime balloons, and morale sinks; if it over‑produces, excess WIP ties up cash.

We want a pragmatic way to estimate tomorrow’s worker‑level output before the shift bell rings, using information that is already keyed into the MES at the end of each day:

minutes spent on the production task (actual_productive_minutes)
number of style changes handled (style_change_count)
average WIP on the line
day of week (some lines work half‑day on Fridays)
team identifier (different group dynamics)
quarter‑hour absenteeism rate
operator skill band assigned by HR (A / B / C)
whether today was an incentive‑bonus day

Because many of these predictors are correlated—lines with frequent style changes also suffer higher WIP—we’ll use Ridge regression, a linear model with L2 regularisation that keeps coefficients stable and readable.

Libraries Required

pandas # reading & wrangling CSV
numpy # numeric helpers
matplotlib.pyplot # quick plots (optional)
scikit‑learn # preprocessing, RidgeCV, metrics
joblib # model persistence

Dataset Link

Productivity Prediction of Garment Employees

Step by Step Code Implementation

1. Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error, r2_score
import joblib

2. Load the Kaggle dataset

Factory metrics are collinear (many garments in longer SMV styles also show higher WIP). Plain OLS can blow up coefficients; Ridge tames them, giving a robust, still‑linear model.

df = pd.read_csv("garments_worker_productivity.csv")   # file after unzip
print(df.head())

Columns you should see

field	sample
productivity	0.78 (label – fraction of quota hit)
observation_date	2015‑01‑04
quarter	Q1
department	sewing
team	3
no_of_workers	35
no_of_style_change	2
wip	420
smv	26.1
incentive	0 (1 = bonus day)
idle_men	2
idle_time	92
actual_productive_minutes	425

The original target is % of the target quota achieved. We’ll convert that to pieces by multiplying by the target SMV (standard minute value):

# approximate daily output in “garment equivalents”
df['daily_output_pcs'] = df['productivity'] * df['actual_productive_minutes'] / df['smv']

3. Define features and label

Cross‑validated Ridge automatically picks the α that minimises validation error—no manual tuning, guesswork.

target   = 'daily_output_pcs'

num_cols = ['actual_productive_minutes', 'no_of_style_change',
            'wip', 'idle_time', 'no_of_workers', 'smv', 'incentive']

cat_cols = ['department', 'team', 'quarter']

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing + Ridge pipeline

Numeric features vary wildly (minutes vs style changes). Scaling ensures Ridge’s L2 penalty treats them evenly; categories turn into binary flags without assumed order.

preproc = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

alphas = [0.1, 1.0, 10.0, 50.0, 100.0]        # candidate regularisation strengths
ridge  = RidgeCV(alphas=alphas, cv=5)

pipe = Pipeline([
        ('prep',  preproc),
        ('model', ridge)
])

5. Train‑test split & fitting

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Performance check

pred = pipe.predict(X_test)

print(f"α chosen by CV  : {pipe.named_steps['model'].alpha_}")
print(f"R² on hold‑out  : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out : {mean_absolute_error(y_test, pred):.2f} pieces")

7. What really drives output?

onehot = pipe.named_steps['prep'].named_transformers_['cat']
ohe_names = onehot.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = pd.Series(pipe.named_steps['model'].coef_,
                  index=feature_names).sort_values()

print("\nTop positive drivers of output:")
print(coefs.tail(7))
print("\nTop negative drivers of output:")
print(coefs.head(7))

8. Persist for tomorrow’s scheduling tool

joblib.dump(pipe, "ridge_worker_output.pkl")

Summary

By piping simple preprocessing into Ridge regression, we’ve produced a transparent, production‑ready forecaster for next‑day worker output:

Practical payoff: planners can see tomorrow’s likely shortfall or surplus before the shift schedule is finalised.
Explainability: every driver shows its pieces‑per‑σ impact; no black‑box mysteries.
Strong baseline: tree ensembles or time‑series nets must beat this MAE while still telling production managers a coherent story.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook