Car Repair Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Fleet managers and independent repair shops need to forecast a vehicle’s repair cost—before mechanics start work—based on early diagnostic indicators such as vehicle age, odometer reading, engine hours, and the cost of the most recent maintenance. Repair costs tend to grow nonlinearly with usage and age (e.g., accelerated wear‐out) and vary due to part‐price volatility and labour rates. A single-point estimate model masks this uncertainty, risking under‐ or overestimation. By applying Bayesian Regression, we obtain both a best estimate of repair cost and a credible interval that quantifies our uncertainty—enabling data-driven quoting, parts procurement, and scheduling.

Libraries Required

import pandas as pd                              # data loading & handling  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error

Dataset

Motor Vehicle Repair & Towing Dataset

Step-by-Step Code Implementation

Import Libraries & Load Data

import pandas as pd

# Load the repair dataset
df = pd.read_csv("data/motor-vehicle-repair-and-towing-dataset.csv")

# Preview key columns
df[['Vehicle_Age','Mileage','Engine_Hours','Prev_Repair_Cost','Repair_Cost']].head()

Preprocessing & Train/Test Split

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Select predictors and target
X = df[['Vehicle_Age','Mileage','Engine_Hours','Prev_Repair_Cost']].values
y = df['Repair_Cost'].values  # USD

# Split data (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit Bayesian Regression Model

Priors:

α ∼ Normal(0, 100): a broad prior for the intercept.
β ∼ Normal(0, 50): moderate uncertainty for each standardised predictor’s effect.
σ ∼ HalfNormal(50): positive noise scale.

Model:

The linear predictor μ = α + β·X_standardized captures how features drive repair cost.
Observations y_train are modelled as Normal(μ, σ).

Sampling: We draw 2,000 posterior samples (with 1,000 burn-in) using target_accept=0.9 to ensure stable convergence.

import pymc3 as pm

with pm.Model() as repair_model:
    # Priors for intercept and weights
    α = pm.Normal("α", mu=0, sigma=100)
    β = pm.Normal("β", mu=0, sigma=50, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=50)

    # Linear predictor
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # Sample posterior
    trace = pm.sample(
        draws=2000,
        tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

Predictions & Evaluation: Posterior means of α and β yield point forecasts; MAE on held‑out data measures average error.

import arviz as az

# Summarize the posterior distributions
az.summary(trace, round_to=2)

# Posterior predictive sampling
with repair_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point predictions on the test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate mean absolute error
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")

Visualise Predictions with Credible Intervals

Sweeping one feature (mileage) while holding others fixed, we plot both the posterior mean and 94% credible bands—illustrating expected trend and uncertainty.

import numpy as np
import matplotlib.pyplot as plt

# Vary Mileage; hold other features at median
mileage_grid = np.linspace(X_train_s[:,1].min(), X_train_s[:,1].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,1] = mileage_grid

with repair_model:
    pm.set_data({"X": grid})
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Convert mileage back to original scale
mileage_orig = scaler.inverse_transform(
    np.column_stack([grid[:,0], grid[:,1], grid[:,2], grid[:,3]])
)[:,1]

plt.figure(figsize=(8,5))
plt.plot(mileage_orig, mean_pred, label="Posterior mean")
plt.fill_between(mileage_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(scaler.inverse_transform(X_test_s)[:,1], y_test,
            color="k", alpha=0.5, label="Test data")
plt.xlabel("Mileage")
plt.ylabel("Repair Cost (USD)")
plt.title("Bayesian Regression: Repair Cost vs. Mileage")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression workflow for car repair‑cost forecasting provides:

1. Point estimates of expected repair cost from early vehicle indicators.

2. Credible intervals that quantify forecasting uncertainty—enabling more reliable quoting and budgeting.

3. Actionable insights: repair shops and fleet managers gain both the expected cost and its uncertainty bounds, supporting proactive parts procurement and scheduling decisions.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

Car Repair Cost Prediction using Bayesian Regression in ML

Libraries Required

Dataset

Step-by-Step Code Implementation

Import Libraries & Load Data

Preprocessing & Train/Test Split