Assembly Line Efficiency Prediction using Polynomial Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Manufacturing engineers and operations managers need to forecast the efficiency of an assembly line—measured as the percentage of defect-free units produced per hour—based on early indicators such as machine downtime, throughput rate, number of operators, and maintenance hours, before full-shift data are available. Real‑world observations show that efficiency responds nonlinearly to downtime (small reductions yield significant gains up to a point), to operator count (diminishing returns beyond optimal staffing), and to maintenance hours (too little or too much both hurt). A simple linear model underfits these curves; a high‑degree polynomial without regularisation overfits to noise. By employing Polynomial Regression on a set of engineered numeric and categorical features with Ridge (ℓ²) regularisation, we can capture smooth efficiency trends and deliver reliable, interpretable predictions for proactive resource planning.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
Bosch Production Line Performance
Step-by-Step Code Implementation
Load Data & Libraries
We merge part‑level measurements (train_numeric.csv) with pass/fail labels (train_date.csv), then group by an inferred LineID to compute session‑level efficiency (fraction passed) and throughput (parts processed).
import pandas as pd
import numpy as np
# Load feature and target files (adjust paths)
features = pd.read_csv("data/train_numeric.csv", nrows=500000)
labels = pd.read_csv("data/train_date.csv", nrows=500000)
# Merge on 'Id' and sample down for speed
df = features.merge(labels[['Id','Response']], on='Id').sample(100000, random_state=42)
# Compute per‑Id pass/fail as efficiency indicator
df['Passed'] = (df['Response'] == 0).astype(int)
Feature Engineering & Aggregation
In practice, you’d extract actual downtime, operator counts, and maintenance logs; here, we simulate for demonstration purposes.
# For simplicity, aggregate at the line-session level by Id prefix
# Assume Ids encode line (e.g., first digits); extract a mock 'LineID'
df['LineID'] = (df['Id'] // 1000000).astype(int)
# Group by LineID to get features:
# - avg machine downtime (mocked from date columns)
# - throughput = avg parts per session
# - avg operators (mock feature)
# - avg maintenance hours (mock feature)
agg = df.groupby('LineID').agg({
'Passed': ['mean','count']
})
agg.columns = ['Efficiency','Throughput']
# Mock additional features
np.random.seed(42)
agg['Downtime_Hours'] = np.random.uniform(0, 2, size=len(agg))
agg['Operator_Count'] = np.random.randint(5, 15, size=len(agg))
agg['Maintenance_Hours'] = np.random.uniform(0, 3, size=len(agg))
agg = agg.reset_index()
Define Features & Target
Expands numeric inputs into squared and interaction terms (e.g., Throughput², Throughput×Downtime_Hours), capturing nonlinear returns and trade‑offs.
X = agg[['Throughput','Downtime_Hours','Operator_Count','Maintenance_Hours']] y = agg['Efficiency'] # fraction passed per session
Build Polynomial Regression Pipeline
- Standard Scaler: Z‑scores feature so Ridge’s ℓ² penalty treats them equally, regardless of original scale.
- Ridge Regression: Applies an ℓ² penalty (controlled by alpha) to shrink high‑order coefficients and prevent overfitting.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
Train/Test Split & Hyperparameter Search
- GridSearchCV: Tunes polynomial degree (1–3) and alpha (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE on held‑out folds.
from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.3f} (efficiency fraction)")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Identifies which polynomial or interaction terms most strongly influence predicted efficiency, offering actionable levers (e.g., reducing downtime squared term) for process improvements.
# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=X.columns)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
important = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False).head(10)
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
important.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Assembly Efficiency")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
By integrating polynomial feature engineering with Ridge regularisation in a streamlined pipeline, this approach provides:
1. Accurate modelling of assembly line efficiency, capturing nonlinear effects of throughput, downtime, staffing, and maintenance planning.
2. Controlled complexity, avoiding overfitting to idiosyncratic noise via α‑tuning.
3. Interpretable insights, highlighting the most influential polynomial terms—guiding targeted interventions to maximize defect‑free output.