Manufacturing Defect Cost Prediction with Ridge Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Manufacturing firms keep painstaking records of every unit that leaves the production line. When a part is rejected—because it is scratched, out of tolerance, or fails the final test—the company incurs direct defect-handling costs: extra labour, scrap material, re-inspection, rework, and sometimes expedited shipping for replacements.

The finance and continuous‑improvement teams want a forward‑looking estimate of that defect cost per batch so that they can:

Flag high‑risk lots for inspection early,
Budget the correct level of quality‐control resources, and
Measure the ROI of planned process upgrades.

We will build a Ridge Regression model (linear regression with L2 regularisation) that predicts the total defect-handling cost in USD for a given production batch using routinely logged process metrics (temperature, pressure, line speed, operator skill, supplier batch codes, etc.). Ridge keeps the model linear and interpretable while damping down unstable coefficients that often appear when many closely related process variables are fed in.

Libraries Required

pandas # data preparation
numpy # numerical helpers
matplotlib.pyplot # quick diagnostic plots (optional)
scikit‑learn # preprocessing, Ridge regression, metrics
joblib # model persistence

Dataset Link

Predicting Manufacturing Defects Dataset

Step by Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error, r2_score
import joblib

2. Load the dataset

Download Predicting Manufacturing Defects from Kaggle and unzip it in your working directory:

df = pd.read_csv("predicting_manufacturing_defects.csv")
print(df.shape)
print(df.head())

3. Initial clean‑up

Centres each feature and scales to unit variance so that Ridge’s L2 penalty treats kWh, °C, and line speed evenly.

# keep only rows where the target is present
df = df.dropna(subset=['DefectCost']).copy()

# list all numeric and categorical columns
num_cols = [c for c in df.columns if df[c].dtype != 'object' and c != 'DefectCost']
cat_cols = [c for c in df.columns if df[c].dtype == 'object']

4. Build the preprocessing + Ridge pipeline

Converts supplier codes, shift IDs, and machine IDs into binary flags without imposing a fake numeric order.

preproc = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

# RidgeCV will search a handful of α values and pick the best with 5‑fold CV
alphas = [0.1, 1.0, 10.0, 50.0, 100.0]
ridge  = RidgeCV(alphas=alphas, cv=5)

model = Pipeline(steps=[
        ('prep',  preproc),
        ('ridge', ridge)
])

5. Train–test split and model fitting

X = df[num_cols + cat_cols]
y = df['DefectCost']

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

model.fit(X_train, y_train)

6. Evaluation

Classic linear model with an L2 penalty. The cross‑validation built into RidgeCV chooses the α that minimises validation error, giving you a bias‑variance sweet‑spot without manual tuning.

pred = model.predict(X_test)

print(f"Optimal α chosen by CV : {model.named_steps['ridge'].alpha_:.2f}")
print(f"R² on hold‑out set     : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out set    : ${mean_absolute_error(y_test, pred):,.0f}")

7. Inspecting the coefficients

Because everything numeric was z‑scored, each coefficient reads as “dollar change in defect cost for a one‑standard‑deviation increase in that metric.” A significant positive weight on SupplierCode_XYZ or Temp_Above_90C immediately indicates to engineering where to focus a Six-Sigma project.

# Reconstruct the full feature list after one‑hot encoding
ohe = model.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)

feature_names = np.concatenate([ohe_names, num_cols])

coefs = pd.Series(model.named_steps['ridge'].coef_, index=feature_names)
print(coefs.sort_values(ascending=False).head(10))   # strongest cost adders
print(coefs.sort_values().head(10))                  # strongest cost reducers

8. Persist the pipeline for re‑use

The .pkl file stores preprocessing and coefficients together, so tomorrow’s MES (Manufacturing Execution System) can load it and score a new batch in milliseconds—no coding required on the production line.

joblib.dump(model, "ridge_defect_cost_model.pkl")

Summary

By pairing Ridge regression with a tidy preprocessing pipeline, we produced an interpretable, production‑ready predictor of manufacturing defect cost:

Real‑world benefit: quality engineers can pre‑price the cost of poor quality before parts even leave the line, helping them justify preventive action.
Transparency: every coefficient is a dollar number—no black‑box gloom—while Ridge’s L2 penalty tames multicollinearity.
Future‑proof: tree models or neural nets may beat the MAE, but they must justify their extra complexity against this sturdy, explain‑it‑to‑your‑boss baseline.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook