Industrial Production Cost Prediction using ElasticNet Algorithm in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Manufacturing planners must quote accurate, part‑level production costs long before the first chip is cut. Total cost depends on a mix of continuous factors—batch size, machining time, material weight—and categorical choices such as material grade or machine group. These predictors are often highly collinear (e.g., batch size ↔ setup labour).
Ordinary least‑squares inflates coefficients under multicollinearity, while pure Lasso (ℓ¹) can over‑shrink and drop genuinely helpful variables. Elastic Net combines Ridge’s stability (ℓ²) with Lasso’s sparsity to yield a robust, interpretable model that forecasts manufacturing cost (USD) for a new job, helping estimators bid competitively without bleeding profit.
Libraries Required
| Purpose | Python Library |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
Download and load the dataset
The Kaggle Manufacturing Cost file lists simulated jobs with material, labour, overhead, machine group, and batch size, making it ideal for a cost‑prediction tutorial.
# One‑time shell (requires Kaggle API token in ~/.kaggle/kaggle.json)
# kaggle datasets download -d vinicius150987/manufacturing-cost -p data --unzip
df = pd.read_csv("data/manufacturing_cost_dataset.csv") # adjust filename if needed
Dataset snapshot: [‘Units’, ‘Material_Type’, ‘Machine_Group’, ‘Setup_Hours’, ‘Run_Time_Hours’, ‘Labour_Rate_USD’, ‘Material_Cost_USD’, ‘Overhead_USD’, ‘Total_Cost_USD’]
Initial inspection
print(df.head())
sns.histplot(df['Total_Cost_USD'], kde=True)
plt.title('Distribution of Manufacturing Cost'); plt.show()
print(df.isna().mean()) # check missing values
Define target & features
y = df['Total_Cost_USD'] X = df.drop(columns=['Total_Cost_USD']) # predictors only cat_cols = ['Material_Type', 'Machine_Group'] num_cols = [c for c in X.columns if c not in cat_cols]
Pre‑processing & Elastic Net pipeline
All transformations live inside a single object, eliminating data leakage. The same preprocessing runs automatically during .predict().
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('model', ElasticNet(max_iter=20000, random_state=42))
])
Train/test split + hyper‑parameter search
- α (alpha) adjusts overall shrinkage;
- l1_ratio (0→Ridge, 1→Lasso) balances stability vs sparsity.
Cross‑validating 18 α × 9 ratios (162 models) finds the lowest RMSE.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {
'model__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'model__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
print("Best α:", gs.best_params_['model__alpha'])
print("Best l1_ratio:", gs.best_params_['model__l1_ratio'])
Evaluate on the hold‑out set
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
Interpret coefficients
The coefficient plot immediately shows, for example, that each extra run‑time hour adds $58 on average, while choosing Alloy Steel boosts cost by $320 over the carbon‑steel baseline. Zeroed features signal metrics that, given the others, contribute negligible incremental cost.
# Recover feature names
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# Un‑scale numeric coeffs
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef = gs.best_estimator_.named_steps['model'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales
imp = (pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Manufacturing Cost (Elastic Net)')
plt.xlabel('Δ Cost (USD)'); plt.tight_layout(); plt.show()
Summary
By coupling Elastic Net regression with a tidy Pipeline, we created a transparent, high‑bias‑low‑variance estimator that:
- Predicts per‑job manufacturing cost within a small error band (low RMSE).
- Handles multicollinearity among production metrics while deleting noise.
- Explains itself via dollar‑impact coefficients—providing engineers with instant levers for cost reduction.
Updating the model is trivial: drop next quarter’s ERP extract into the notebook and call gs.fit(). Cost estimation just moved from back‑of‑envelope to reproducible data science.