Car Maintenance Cost Prediction using Linear Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Repair bills can surprise car owners and fleet managers alike. Knowing a vehicle’s expected upkeep bill for the coming year helps in setting aside cash, pricing warranties, and even deciding when to sell.

In this mini‑project, we create a linear‑regression baseline that predicts a car’s annual maintenance cost (USD) from everyday attributes—age, mileage, engine size, make/model, fuel type, and transmission. The transparent coefficients serve as a first‑order guide before you move on to richer, nonlinear models.

Libraries Required

pandas # tabular wrangling
numpy # numerical helpers
matplotlib.pyplot # sanity‑check visuals
seaborn # optional correlation heatmaps
scikit‑learn # preprocessing, model, metrics
joblib # persist trained pipeline

Dataset Link

Logistics Vehicle Maintenance History Dataset

Step-by-Step Implementation

Why linear regression? Within normal operating ranges, annual upkeep often grows roughly linearly with mileage and age; treating make/model and power‑train as additive categorical shifts produces an interpretable first cut.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns                           # optional

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the data

df = pd.read_csv("vehicle_maintenance_history.csv")
print(df.head())

3. Quick cleaning

# keep only rows with a recorded maintenance cost
df = df.dropna(subset=['Maintenance_Cost'])

# simple removal of rows missing any core predictor
core_cols = ['Age_Years', 'Mileage_km', 'Engine_Size_L',
             'Make', 'Model', 'Fuel_Type', 'Transmission']
df = df.dropna(subset=core_cols)

4. Feature lists & label

num_cols = ['Age_Years', 'Mileage_km', 'Engine_Size_L']
cat_cols = ['Make', 'Model', 'Fuel_Type', 'Transmission']
target   = 'Maintenance_Cost'

X = df[num_cols + cat_cols]
y = df[target]

5. Pre‑processing & model pipeline

One-hot encoding avoids imposing a false ordering on brands or fuel types while letting the model learn a clean per-category cost offset.
ColumnTransformer + Pipeline keeps preprocessing and coefficient fitting joined at the hip—critical when you serve the model in production.

ohe = OneHotEncoder(handle_unknown='ignore')

preproc = ColumnTransformer([
        ('cat', ohe, cat_cols)
    ], remainder='passthrough')           # numeric features pass through

linreg = LinearRegression(n_jobs=-1)

pipe = Pipeline(steps=[
        ('prep', preproc),
        ('model', linreg)
])

6. Train/test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Evaluation

Evaluation metrics — R² shows variance explained, while MAE (in dollars) speaks directly to wallet impact.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f} per vehicle‑year")

8. Inspect top cost drivers

The coefficient table quickly surfaces, which makes models or power-trains rack up the heaviest bills—valuable insight for warranty pricing or resale valuation.

# rebuild feature names after one‑hot encoding
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

feature_names = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)

# 10 strongest positive and negative effects
print("\nMost expensive factors:")
print(coefs.sort_values(ascending=False).head(10))
print("\nCost‑reducing factors:")
print(coefs.sort_values().head(10))

9. Persist the trained pipeline

Model persistence with joblib enables batch scoring of fresh fleet logs without retraining, ensuring consistent predictions over time.

joblib.dump(pipe, "car_maintenance_cost_linreg.pkl")

Summary

By pairing everyday vehicle attributes with a transparent linear model, we’ve produced a working maintenance‑cost estimator in fewer than fifty lines of code. The coefficients double as an instant league table of cost drivers, while the pipeline architecture sets the stage for future upgrades—regularised regressors, gradient‑boosted trees, or time‑series features—without throwing away your baseline.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook