Maintenance Fee Prediction using using Linear Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Facility‑management teams must forecast the expected cost of upcoming maintenance work orders so they can set realistic budgets, negotiate vendor contracts, and prioritise preventive jobs. Using historical work‑order records from twelve North American university campuses, we will build a linear‑regression baseline that predicts a work order’s total maintenance fee (USD) from information known at the time the ticket is raised: labour hours, job type (planned vs unplanned), system, subsystem, building age, and current local weather.
A transparent linear model exposes first-order cost drivers—labour time, HVAC versus electrical systems, and emergency requests—and supplies a factual benchmark before experimenting with regularised or tree-based regressors.
Libraries Required
- pandas # tabular wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick EDA plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Facility Management Unified Classification Database (FMUCD)
Step-by-Step Code Implementation
Why linear regression? For small-scope maintenance jobs, the cost scales nearly linearly with labour hours, plus additive surcharges for emergency or specialised systems. A straight-line model quantifies each contributor in clear dollar terms.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load & glimpse the data
(Download the CSV pack, then point the path below to the work_order_cost.csv file.)
df = pd.read_csv("work_order_cost.csv")
print(df.head())
Typical columns
| column | description (example) |
| wo_cost_usd | label – total fee of the job |
| labour_hours | technician hours booked |
| request_type | Planned / Unplanned |
| system | HVAC, Electrical, Plumbing … |
| subsystem | Chiller, Lighting, Pumps … |
| building_age | years since construction |
| outside_temp_c | °C at start time |
| humidity_pct | % RH |
3. Minimal cleaning
Standard scaling puts hours, age, and weather on comparable variance, so coefficients read as $ per 1 σ change—handy for management reports.
core = ['wo_cost_usd', 'labour_hours', 'request_type',
'system', 'subsystem', 'building_age',
'outside_temp_c', 'humidity_pct']
df = df.dropna(subset=core).copy()
4. Define predictors & label
num_cols = ['labour_hours', 'building_age',
'outside_temp_c', 'humidity_pct']
cat_cols = ['request_type', 'system', 'subsystem']
target = 'wo_cost_usd'
X = df[num_cols + cat_cols]
y = df[target]
5. Pre‑processing & model pipeline
One-hot encoding treats categorical flags as pure categories, avoiding a false numeric order between, for example, HVAC and Electrical jobs.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline(steps=[
('prep', preproc),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
Performance metrics – R² captures the share of cost variance explained; MAE tells planners the typical absolute dollar error—e.g., “±$180 per work order.”
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")
8. Inspect cost drivers
The coefficient table highlights levers: if Unplanned jobs carry the heaviest positive weight, focusing on preventive maintenance could deliver immediate savings.
# recover encoded feature names
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
.sort_values())
print("\nCost‑reducing factors (negative coefficients):")
print(coef.head(8)) # most negative
print("\nCost‑increasing factors (positive coefficients):")
print(coef.tail(8)) # most positive
9. Persist the trained pipeline
Joblib persistence packages, preprocessing, and coefficients so tomorrow’s CMMS (computerised maintenance‑management system) can load .pkl, feed new work‑order tickets, and output a cost estimate in milliseconds.
joblib.dump(pipe, "maintenance_fee_linreg.pkl")
Summary
In fewer than 90 lines of Python code, we transformed raw work-order logs into an explainable maintenance-fee predictor. The baseline model:
- Provides instant cost estimates for budgeting and approval workflows.
- Reveals transparent marginal effects—every technician hour, emergency flag, or HVAC task’s extra dollar impact.
Keep this interpretable baseline as your benchmark; when you introduce regularised regressors or ensemble models, you’ll know exactly how much added accuracy justifies the extra complexity.