Production Cost Prediction using Linear Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

A manufacturing plant wants to forecast the direct production cost it will incur tomorrow so that planners can lock in raw‑material purchases, schedule overtime, and quote prices with confidence.

Using the public “Predicting Manufacturing Defects” dataset, which logs daily shop-floor metrics such as production volume, supplier quality, downtime, maintenance hours, workforce productivity, and energy usage, we will train a linear-regression baseline that predicts Production Cost (USD/day) from the other operational features. While mature plants may later adopt ensemble or causal models, a transparent linear fit reveals exactly how strongly each lever (e.g., downtime or energy efficiency) influences the daily cost and supplies a rigorous benchmark for future work.

Libraries Required

pandas # data wrangling
numpy # numeric helpers
matplotlib.pyplot # quick sanity‑check plots
scikit‑learn # preprocessing, model, metrics
joblib # save the trained pipeline

Dataset Link

Predicting Manufacturing Defects Dataset

Step-by-Step Code Implementation

Why linear regression first?
Production cost in most factories is an additive budget:
material + labour + energy + overhead. Each driver contributes roughly linearly, so an OLS fit is both sensible and instantly explainable.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the data

df = pd.read_csv("production_cost_dataset.csv")   # rename after download
print(df.head())

Expected numeric columns (all float/int):

Column	Typical range
ProductionVolume	100 – 1000 units
SupplierQuality	80 – 100 (%)
DeliveryDelay	0 – 5 days
MaintenanceHours	0 – 24 h / week
DowntimePercentage	0 – 5 (%)
InventoryTurnover	2 – 10
StockoutRate	0 – 10 (%)
WorkerProductivity	80 – 100 (%)
SafetyIncidents	0 – 10 / month
EnergyConsumption	1 000 – 5 000 kWh
EnergyEfficiency	0.10 – 0.50
AdditiveProcessTime	1 – 10 h
AdditiveMaterialCost	100 – 500 USD
ProductionCost	$5 000 – $20 000 (target)

3. Basic cleaning

core = [c for c in df.columns if c != 'ProductionCost'] + ['ProductionCost']
df   = df.dropna(subset=core)              # drop incomplete rows

4. Define predictors & label

All predictors are numeric but on wildly different scales (kWh vs %), so scaling prevents the solver from being dominated by large‑magnitude columns and makes every coefficient read as “$ per one‑σ shift.”

num_cols = [c for c in df.columns if c != 'ProductionCost']  # all numeric
target   = 'ProductionCost'

X = df[num_cols]
y = df[target]

5. Pre‑processing & model pipeline

preproc = ColumnTransformer([
        ('scale', StandardScaler(), num_cols)   # only numeric features
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', linreg)
])

6. Train‑test split & training

Rows represent independent days; there’s no future leakage in random splitting.
(For sequential forecasting, you’d use a time‑based split.)

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

pipe.fit(X_train, y_train)

7. Evaluation

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")

A mean‑absolute error of, say, $1 350 tells planners how much buffer to keep when quoting.

8. Inspect cost drivers

Because inputs are z‑scored, the most significant positive coefficient pinpoints the most expensive lever (often DowntimePercentage or AdditiveMaterialCost). At the same time, the most negative reveals the most cost-effective savings opportunity (often EnergyEfficiency or SupplierQuality).

coef_series = pd.Series(
        pipe.named_steps['model'].coef_,
        index=num_cols).sort_values()

print("\nCost‑reducing levers (negative coefficients):")
print(coef_series.head(5))

print("\nCost‑increasing levers (positive coefficients):")
print(coef_series.tail(5))

Example insight:
Every extra percentage point of downtime might add ≈ $430, whereas a one‑standard‑deviation improvement in energy efficiency could trim ≈ $900.

9. Persist the trained pipeline

Management can already answer:
“If we shave one percentage point off downtime, we expect to save $430/day.”
Before switching to gradient‑boosted trees or causal‑impact models, you’ll know exactly how much extra accuracy the complexity must deliver.

joblib.dump(pipe, "production_cost_linreg.pkl")

Summary

In fewer than 100 lines of code, we turned raw shop‑floor logs into an explainable production‑cost forecaster:

Immediate $- level predictions facilitate budgeting, vendor negotiation, and proactive maintenance planning.
Crystal-clear coefficients quantify which levers (downtime, energy, material cost, supplier quality) have the most significant impact on daily spend.

Keep this linear model as your yardstick—every fancier algorithm you try next must beat its MAE and still tell a credible story to the plant manager.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook