Factory Maintenance Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Manufacturing firms incur equipment maintenance costs, but blanket preventive schedules can waste resources or invite breakdowns when maintenance is overdue. In this project, we’ll predict annual maintenance cost for factories using operational and financial indicators—such as production volume, equipment age, workforce size, and R&D spend—by fitting a linear model with stepwise feature selection. The resulting parsimonious regression will highlight the most influential cost drivers, enabling better budgeting and targeted upkeep strategies.

Libraries Required

import pandas as pd               # Data handling  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Plotting

Dataset

Maintenance costs, ML and big data

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a panel dataset covering 82 factories from 2019 to 2023, containing cost and operational metrics. Initial inspection (.info(), .describe()) reveals numeric features (production volume, equipment age, etc.) and a categorical industry sector.

# Block 1: Load dataset
# Panel data from 82 industrial organizations over 2019–2023  
url = "https://www.kaggle.com/datasets/mariojesenia/maintenance-costs-ml-and-big-data/download"
df = pd.read_csv(url)

# Inspect structure
print(df.head())
print(df.info())
print(df.describe())

Dataset includes columns such as Factory_ID, Year, Industry_Sector, Production_Volume, Equipment_Age, Num_Employees, R&D_Spend, and Maintenance_Cost.

Data Preprocessing

We one‑hot encode Industry_Sector, drop rows with missing values, and remove the identifier columns (Factory_ID, Year). The remaining predictors (X) and the target (y) are split into training and test sets for unbiased evaluation.

# Block 2: Encode categoricals and clean data
df_enc = pd.get_dummies(df, columns=["Industry_Sector"], drop_first=True)

# Drop rows with missing values (if any)
df_enc = df_enc.dropna()

# Define predictors and target
X = df_enc.drop(["Factory_ID", "Year", "Maintenance_Cost"], axis=1)
y = df_enc["Maintenance_Cost"]

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function alternates between forward inclusion—adding the excluded variable with the lowest p‑value below 0.01—and backward elimination—dropping the included variable with the highest p‑value above 0.05—until convergence to an optimal subset.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out=0.05, 
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step
        excluded = list(set(X.columns) - set(included))
        pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            pvals[col] = model.pvalues[col]
        best_pval = pvals.min()
        if best_pval < threshold_in:
            best_var = pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.6f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals_in = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals_in.max()
        if worst_pval > threshold_out:
            worst_var = pvals_in.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Model Fitting: We fit an Ordinary Least Squares regression on the selected features. The .summary() displays coefficients, p‑values, adjusted R², and diagnostic statistics, offering insight into feature significance.
Evaluation: Predictions on the test set yield R² (variance explained) and RMSE (prediction error), quantifying how well the model generalises to unseen factories.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A scatter plot of residuals vs predicted costs checks for heteroscedasticity or systematic bias, validating OLS assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Maintenance Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

Using stepwise regression on maintenance‑cost panel data, we isolate key cost drivers—such as production volume, equipment age, workforce size, and specific industry sectors—while pruning redundant factors. The final linear model balances interpretability (few, significant predictors) with strong predictive performance (high R², low RMSE) on held‑out data. Factory managers can leverage these insights for precise budgeting, targeted maintenance scheduling, and ultimately more cost‑effective operations.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook