Car Repair Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Fleet managers and independent repair shops need to forecast a vehicle’s repair cost—before mechanics start work—based on early diagnostic indicators such as vehicle age, odometer reading, engine hours, and the cost of the most recent maintenance. Repair costs tend to grow nonlinearly with usage and age (e.g., accelerated wear‐out) and vary due to part‐price volatility and labour rates. A single-point estimate model masks this uncertainty, risking under‐ or overestimation. By applying Bayesian Regression, we obtain both a best estimate of repair cost and a credible interval that quantifies our uncertainty—enabling data-driven quoting, parts procurement, and scheduling.
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Motor Vehicle Repair & Towing Dataset
Step-by-Step Code Implementation
Import Libraries & Load Data
import pandas as pd
# Load the repair dataset
df = pd.read_csv("data/motor-vehicle-repair-and-towing-dataset.csv")
# Preview key columns
df[['Vehicle_Age','Mileage','Engine_Hours','Prev_Repair_Cost','Repair_Cost']].head()
Preprocessing & Train/Test Split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Select predictors and target
X = df[['Vehicle_Age','Mileage','Engine_Hours','Prev_Repair_Cost']].values
y = df['Repair_Cost'].values # USD
# Split data (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize features for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(0, 100): a broad prior for the intercept.
- β ∼ Normal(0, 50): moderate uncertainty for each standardised predictor’s effect.
- σ ∼ HalfNormal(50): positive noise scale.
Model:
- The linear predictor μ = α + β·X_standardized captures how features drive repair cost.
- Observations y_train are modelled as Normal(μ, σ).
Sampling: We draw 2,000 posterior samples (with 1,000 burn-in) using target_accept=0.9 to ensure stable convergence.
import pymc3 as pm
with pm.Model() as repair_model:
# Priors for intercept and weights
α = pm.Normal("α", mu=0, sigma=100)
β = pm.Normal("β", mu=0, sigma=50, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=50)
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample posterior
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
Predictions & Evaluation: Posterior means of α and β yield point forecasts; MAE on held‑out data measures average error.
import arviz as az
# Summarize the posterior distributions
az.summary(trace, round_to=2)
# Posterior predictive sampling
with repair_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions on the test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate mean absolute error
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions with Credible Intervals
Sweeping one feature (mileage) while holding others fixed, we plot both the posterior mean and 94% credible bands—illustrating expected trend and uncertainty.
import numpy as np
import matplotlib.pyplot as plt
# Vary Mileage; hold other features at median
mileage_grid = np.linspace(X_train_s[:,1].min(), X_train_s[:,1].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,1] = mileage_grid
with repair_model:
pm.set_data({"X": grid})
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Convert mileage back to original scale
mileage_orig = scaler.inverse_transform(
np.column_stack([grid[:,0], grid[:,1], grid[:,2], grid[:,3]])
)[:,1]
plt.figure(figsize=(8,5))
plt.plot(mileage_orig, mean_pred, label="Posterior mean")
plt.fill_between(mileage_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(scaler.inverse_transform(X_test_s)[:,1], y_test,
color="k", alpha=0.5, label="Test data")
plt.xlabel("Mileage")
plt.ylabel("Repair Cost (USD)")
plt.title("Bayesian Regression: Repair Cost vs. Mileage")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression workflow for car repair‑cost forecasting provides:
1. Point estimates of expected repair cost from early vehicle indicators.
2. Credible intervals that quantify forecasting uncertainty—enabling more reliable quoting and budgeting.
3. Actionable insights: repair shops and fleet managers gain both the expected cost and its uncertainty bounds, supporting proactive parts procurement and scheduling decisions.