Factory Maintenance Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Manufacturing firms incur equipment maintenance costs, but blanket preventive schedules can waste resources or invite breakdowns when maintenance is overdue. In this project, we’ll predict annual maintenance cost for factories using operational and financial indicators—such as production volume, equipment age, workforce size, and R&D spend—by fitting a linear model with stepwise feature selection. The resulting parsimonious regression will highlight the most influential cost drivers, enabling better budgeting and targeted upkeep strategies.
Libraries Required
import pandas as pd # Data handling import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Plotting
Dataset
Maintenance costs, ML and big data
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a panel dataset covering 82 factories from 2019 to 2023, containing cost and operational metrics. Initial inspection (.info(), .describe()) reveals numeric features (production volume, equipment age, etc.) and a categorical industry sector.
# Block 1: Load dataset # Panel data from 82 industrial organizations over 2019–2023 url = "https://www.kaggle.com/datasets/mariojesenia/maintenance-costs-ml-and-big-data/download" df = pd.read_csv(url) # Inspect structure print(df.head()) print(df.info()) print(df.describe())
Dataset includes columns such as Factory_ID, Year, Industry_Sector, Production_Volume, Equipment_Age, Num_Employees, R&D_Spend, and Maintenance_Cost.
Data Preprocessing
We one‑hot encode Industry_Sector, drop rows with missing values, and remove the identifier columns (Factory_ID, Year). The remaining predictors (X) and the target (y) are split into training and test sets for unbiased evaluation.
# Block 2: Encode categoricals and clean data
df_enc = pd.get_dummies(df, columns=["Industry_Sector"], drop_first=True)
# Drop rows with missing values (if any)
df_enc = df_enc.dropna()
# Define predictors and target
X = df_enc.drop(["Factory_ID", "Year", "Maintenance_Cost"], axis=1)
y = df_enc["Maintenance_Cost"]
# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function alternates between forward inclusion—adding the excluded variable with the lowest p‑value below 0.01—and backward elimination—dropping the included variable with the highest p‑value above 0.05—until convergence to an optimal subset.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step
excluded = list(set(X.columns) - set(included))
pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
pvals[col] = model.pvalues[col]
best_pval = pvals.min()
if best_pval < threshold_in:
best_var = pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.6f}")
# Backward step
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals_in = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals_in.max()
if worst_pval > threshold_out:
worst_var = pvals_in.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: We fit an Ordinary Least Squares regression on the selected features. The .summary() displays coefficients, p‑values, adjusted R², and diagnostic statistics, offering insight into feature significance.
- Evaluation: Predictions on the test set yield R² (variance explained) and RMSE (prediction error), quantifying how well the model generalises to unseen factories.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A scatter plot of residuals vs predicted costs checks for heteroscedasticity or systematic bias, validating OLS assumptions.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Maintenance Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
Using stepwise regression on maintenance‑cost panel data, we isolate key cost drivers—such as production volume, equipment age, workforce size, and specific industry sectors—while pruning redundant factors. The final linear model balances interpretability (few, significant predictors) with strong predictive performance (high R², low RMSE) on held‑out data. Factory managers can leverage these insights for precise budgeting, targeted maintenance scheduling, and ultimately more cost‑effective operations.