Manufacturing Defect Cost Prediction with Ridge Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Manufacturing firms keep painstaking records of every unit that leaves the production line. When a part is rejected—because it is scratched, out of tolerance, or fails the final test—the company incurs direct defect-handling costs: extra labour, scrap material, re-inspection, rework, and sometimes expedited shipping for replacements.
The finance and continuous‑improvement teams want a forward‑looking estimate of that defect cost per batch so that they can:
- Flag high‑risk lots for inspection early,
- Budget the correct level of quality‐control resources, and
- Measure the ROI of planned process upgrades.
We will build a Ridge Regression model (linear regression with L2 regularisation) that predicts the total defect-handling cost in USD for a given production batch using routinely logged process metrics (temperature, pressure, line speed, operator skill, supplier batch codes, etc.). Ridge keeps the model linear and interpretable while damping down unstable coefficients that often appear when many closely related process variables are fed in.
Libraries Required
- pandas # data preparation
- numpy # numerical helpers
- matplotlib.pyplot # quick diagnostic plots (optional)
- scikit‑learn # preprocessing, Ridge regression, metrics
- joblib # model persistence
Dataset Link
Predicting Manufacturing Defects Dataset
Step by Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import mean_absolute_error, r2_score import joblib
2. Load the dataset
Download Predicting Manufacturing Defects from Kaggle and unzip it in your working directory:
df = pd.read_csv("predicting_manufacturing_defects.csv")
print(df.shape)
print(df.head())
3. Initial clean‑up
Centres each feature and scales to unit variance so that Ridge’s L2 penalty treats kWh, °C, and line speed evenly.
# keep only rows where the target is present df = df.dropna(subset=['DefectCost']).copy() # list all numeric and categorical columns num_cols = [c for c in df.columns if df[c].dtype != 'object' and c != 'DefectCost'] cat_cols = [c for c in df.columns if df[c].dtype == 'object']
4. Build the preprocessing + Ridge pipeline
Converts supplier codes, shift IDs, and machine IDs into binary flags without imposing a fake numeric order.
preproc = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
# RidgeCV will search a handful of α values and pick the best with 5‑fold CV
alphas = [0.1, 1.0, 10.0, 50.0, 100.0]
ridge = RidgeCV(alphas=alphas, cv=5)
model = Pipeline(steps=[
('prep', preproc),
('ridge', ridge)
])
5. Train–test split and model fitting
X = df[num_cols + cat_cols]
y = df['DefectCost']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
model.fit(X_train, y_train)
6. Evaluation
Classic linear model with an L2 penalty. The cross‑validation built into RidgeCV chooses the α that minimises validation error, giving you a bias‑variance sweet‑spot without manual tuning.
pred = model.predict(X_test)
print(f"Optimal α chosen by CV : {model.named_steps['ridge'].alpha_:.2f}")
print(f"R² on hold‑out set : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out set : ${mean_absolute_error(y_test, pred):,.0f}")
7. Inspecting the coefficients
Because everything numeric was z‑scored, each coefficient reads as “dollar change in defect cost for a one‑standard‑deviation increase in that metric.” A significant positive weight on SupplierCode_XYZ or Temp_Above_90C immediately indicates to engineering where to focus a Six-Sigma project.
# Reconstruct the full feature list after one‑hot encoding ohe = model.named_steps['prep'].named_transformers_['cat'] ohe_names = ohe.get_feature_names_out(cat_cols) feature_names = np.concatenate([ohe_names, num_cols]) coefs = pd.Series(model.named_steps['ridge'].coef_, index=feature_names) print(coefs.sort_values(ascending=False).head(10)) # strongest cost adders print(coefs.sort_values().head(10)) # strongest cost reducers
8. Persist the pipeline for re‑use
The .pkl file stores preprocessing and coefficients together, so tomorrow’s MES (Manufacturing Execution System) can load it and score a new batch in milliseconds—no coding required on the production line.
joblib.dump(model, "ridge_defect_cost_model.pkl")
Summary
By pairing Ridge regression with a tidy preprocessing pipeline, we produced an interpretable, production‑ready predictor of manufacturing defect cost:
- Real‑world benefit: quality engineers can pre‑price the cost of poor quality before parts even leave the line, helping them justify preventive action.
- Transparency: every coefficient is a dollar number—no black‑box gloom—while Ridge’s L2 penalty tames multicollinearity.
- Future‑proof: tree models or neural nets may beat the MAE, but they must justify their extra complexity against this sturdy, explain‑it‑to‑your‑boss baseline.