Facility Maintenance Cost Trend Prediction with Polynomial Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Facility managers and corporate real‑estate teams need to forecast annual maintenance costs (USD) for a portfolio of industrial and commercial buildings, based on early indicators—such as facility age, square footage, number of critical systems (HVAC, elevators), past preventive‐maintenance spend, and regional labour-rate indices—before budgeting cycles close.
Maintenance costs grow nonlinearly with facility age (ageing systems need exponentially more upkeep), interact with facility size (larger footprints amplify unit costs), and are tempered by past preventive investments. A simple linear model underestimates these curvatures; an unrestricted high‑degree polynomial overfits noise. By applying Polynomial Regression to thoughtfully engineered features with Ridge (ℓ²) regularisation, we can capture smooth cost‑trend dynamics and deliver reliable, interpretable forecasts for proactive budgeting.
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
Predictive Maintenance Dataset
Step-by-Step Code Implementation
Load Data & Inspect
import pandas as pd
# Load and preview
df = pd.read_csv("data/predictive_maintenance.csv")
df = df.rename(columns={
'age': 'Facility_Age',
'usage': 'Annual_Usage_Hours',
'cost': 'Last_Maint_Cost_USD'
})
df.head()[[
'Facility_Age','Square_Footage','Num_Critical_Systems',
'Annual_Usage_Hours','Last_Maint_Cost_USD','Labor_Rate_Index'
]]
Feature Engineering & Target
Feature normalisation: StandardScaler zero‑means and unit‑scales all inputs so the ℓ² penalty treats them uniformly.
# Target: we predict this year's maintenance cost trend
# Simulate next-year cost as a placeholder for supervised learning
df['Next_Year_Cost_USD'] = df['Last_Maint_Cost_USD'] * (
1 + 0.02 * df['Facility_Age'] / 10 # cost grows ~2% per decade of age
+ 0.0005 * df['Square_Footage']
)
# Define features and target
X = df[[
'Facility_Age',
'Square_Footage',
'Num_Critical_Systems',
'Annual_Usage_Hours',
'Last_Maint_Cost_USD',
'Labor_Rate_Index'
]]
y = df['Next_Year_Cost_USD']
Build a Polynomial Regression Pipeline
Polynomial expansion: PolynomialFeatures adds squares and interaction terms (e.g., Facility_Age², Last_Maint_Cost_USD×Annual_Usage_Hours) to capture nonlinear ageing and usage effects on cost.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
Train/Test Split & Hyperparameter Search
- Hyperparameter tuning: grid‑searches polynomial degree (1–3) and α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE on held‑out data.
- Ridge regression: applies an ℓ² penalty (alpha) to shrink noisy high‑order coefficients, mitigating overfitting in the expanded feature space.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best polynomial degree:", gs.best_params_['poly__degree'])
print("Best Ridge α :", gs.best_params_['ridge__alpha'])
Evaluate Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f}")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Interpretability: inspecting the most significant absolute coefficients reveals which nonlinear or interaction terms (e.g., square footage × labour index) most drive maintenance‑cost trends, guiding strategic investments in preventive upkeep.
# Retrieve expanded feature names
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=X.columns)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
imp = pd.Series(coefs, index=feat_names) \
.abs().sort_values(ascending=False).head(10)
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Maintenance Cost")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
By integrating polynomial feature engineering with Ridge regularisation in a concise pipeline, this workflow:
1. Captures nonlinear cost growth due to facility ageing, scale, and usage intensity.
2. Balances model complexity via α‑tuning, avoiding overfitting to outlier facilities.
3. Provides clear, actionable insights: key polynomial features highlight where preventive maintenance or capital reinvestment will most reduce future cost spikes, enabling data‑driven budgeting and asset‑management decisions.