Production Cost Prediction using Linear Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
A manufacturing plant wants to forecast the direct production cost it will incur tomorrow so that planners can lock in raw‑material purchases, schedule overtime, and quote prices with confidence.
Using the public “Predicting Manufacturing Defects” dataset, which logs daily shop-floor metrics such as production volume, supplier quality, downtime, maintenance hours, workforce productivity, and energy usage, we will train a linear-regression baseline that predicts Production Cost (USD/day) from the other operational features. While mature plants may later adopt ensemble or causal models, a transparent linear fit reveals exactly how strongly each lever (e.g., downtime or energy efficiency) influences the daily cost and supplies a rigorous benchmark for future work.
Libraries Required
- pandas # data wrangling
- numpy # numeric helpers
- matplotlib.pyplot # quick sanity‑check plots
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Predicting Manufacturing Defects Dataset
Step-by-Step Code Implementation
Why linear regression first?
Production cost in most factories is an additive budget:
material + labour + energy + overhead. Each driver contributes roughly linearly, so an OLS fit is both sensible and instantly explainable.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
df = pd.read_csv("production_cost_dataset.csv") # rename after download
print(df.head())
Expected numeric columns (all float/int):
| Column | Typical range |
| ProductionVolume | 100 – 1000 units |
| SupplierQuality | 80 – 100 (%) |
| DeliveryDelay | 0 – 5 days |
| MaintenanceHours | 0 – 24 h / week |
| DowntimePercentage | 0 – 5 (%) |
| InventoryTurnover | 2 – 10 |
| StockoutRate | 0 – 10 (%) |
| WorkerProductivity | 80 – 100 (%) |
| SafetyIncidents | 0 – 10 / month |
| EnergyConsumption | 1 000 – 5 000 kWh |
| EnergyEfficiency | 0.10 – 0.50 |
| AdditiveProcessTime | 1 – 10 h |
| AdditiveMaterialCost | 100 – 500 USD |
| ProductionCost | $5 000 – $20 000 (target) |
3. Basic cleaning
core = [c for c in df.columns if c != 'ProductionCost'] + ['ProductionCost'] df = df.dropna(subset=core) # drop incomplete rows
4. Define predictors & label
All predictors are numeric but on wildly different scales (kWh vs %), so scaling prevents the solver from being dominated by large‑magnitude columns and makes every coefficient read as “$ per one‑σ shift.”
num_cols = [c for c in df.columns if c != 'ProductionCost'] # all numeric target = 'ProductionCost' X = df[num_cols] y = df[target]
5. Pre‑processing & model pipeline
preproc = ColumnTransformer([
('scale', StandardScaler(), num_cols) # only numeric features
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
6. Train‑test split & training
Rows represent independent days; there’s no future leakage in random splitting.
(For sequential forecasting, you’d use a time‑based split.)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
7. Evaluation
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")
A mean‑absolute error of, say, $1 350 tells planners how much buffer to keep when quoting.
8. Inspect cost drivers
Because inputs are z‑scored, the most significant positive coefficient pinpoints the most expensive lever (often DowntimePercentage or AdditiveMaterialCost). At the same time, the most negative reveals the most cost-effective savings opportunity (often EnergyEfficiency or SupplierQuality).
coef_series = pd.Series(
pipe.named_steps['model'].coef_,
index=num_cols).sort_values()
print("\nCost‑reducing levers (negative coefficients):")
print(coef_series.head(5))
print("\nCost‑increasing levers (positive coefficients):")
print(coef_series.tail(5))
Example insight:
Every extra percentage point of downtime might add ≈ $430, whereas a one‑standard‑deviation improvement in energy efficiency could trim ≈ $900.
9. Persist the trained pipeline
Management can already answer:
“If we shave one percentage point off downtime, we expect to save $430/day.”
Before switching to gradient‑boosted trees or causal‑impact models, you’ll know exactly how much extra accuracy the complexity must deliver.
joblib.dump(pipe, "production_cost_linreg.pkl")
Summary
In fewer than 100 lines of code, we turned raw shop‑floor logs into an explainable production‑cost forecaster:
- Immediate $- level predictions facilitate budgeting, vendor negotiation, and proactive maintenance planning.
- Crystal-clear coefficients quantify which levers (downtime, energy, material cost, supplier quality) have the most significant impact on daily spend.
Keep this linear model as your yardstick—every fancier algorithm you try next must beat its MAE and still tell a credible story to the plant manager.