Car Maintenance Cost Prediction using Linear Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Repair bills can surprise car owners and fleet managers alike. Knowing a vehicle’s expected upkeep bill for the coming year helps in setting aside cash, pricing warranties, and even deciding when to sell.
In this mini‑project, we create a linear‑regression baseline that predicts a car’s annual maintenance cost (USD) from everyday attributes—age, mileage, engine size, make/model, fuel type, and transmission. The transparent coefficients serve as a first‑order guide before you move on to richer, nonlinear models.
Libraries Required
- pandas # tabular wrangling
- numpy # numerical helpers
- matplotlib.pyplot # sanity‑check visuals
- seaborn # optional correlation heatmaps
- scikit‑learn # preprocessing, model, metrics
- joblib # persist trained pipeline
Dataset Link
Logistics Vehicle Maintenance History Dataset
Step-by-Step Implementation
Why linear regression? Within normal operating ranges, annual upkeep often grows roughly linearly with mileage and age; treating make/model and power‑train as additive categorical shifts produces an interpretable first cut.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # optional from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
df = pd.read_csv("vehicle_maintenance_history.csv")
print(df.head())
3. Quick cleaning
# keep only rows with a recorded maintenance cost
df = df.dropna(subset=['Maintenance_Cost'])
# simple removal of rows missing any core predictor
core_cols = ['Age_Years', 'Mileage_km', 'Engine_Size_L',
'Make', 'Model', 'Fuel_Type', 'Transmission']
df = df.dropna(subset=core_cols)
4. Feature lists & label
num_cols = ['Age_Years', 'Mileage_km', 'Engine_Size_L'] cat_cols = ['Make', 'Model', 'Fuel_Type', 'Transmission'] target = 'Maintenance_Cost' X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & model pipeline
- One-hot encoding avoids imposing a false ordering on brands or fuel types while letting the model learn a clean per-category cost offset.
- ColumnTransformer + Pipeline keeps preprocessing and coefficient fitting joined at the hip—critical when you serve the model in production.
ohe = OneHotEncoder(handle_unknown='ignore')
preproc = ColumnTransformer([
('cat', ohe, cat_cols)
], remainder='passthrough') # numeric features pass through
linreg = LinearRegression(n_jobs=-1)
pipe = Pipeline(steps=[
('prep', preproc),
('model', linreg)
])
6. Train/test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
Evaluation metrics — R² shows variance explained, while MAE (in dollars) speaks directly to wallet impact.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f} per vehicle‑year")
8. Inspect top cost drivers
The coefficient table quickly surfaces, which makes models or power-trains rack up the heaviest bills—valuable insight for warranty pricing or resale valuation.
# rebuild feature names after one‑hot encoding
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
feature_names = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
# 10 strongest positive and negative effects
print("\nMost expensive factors:")
print(coefs.sort_values(ascending=False).head(10))
print("\nCost‑reducing factors:")
print(coefs.sort_values().head(10))
9. Persist the trained pipeline
Model persistence with joblib enables batch scoring of fresh fleet logs without retraining, ensuring consistent predictions over time.
joblib.dump(pipe, "car_maintenance_cost_linreg.pkl")
Summary
By pairing everyday vehicle attributes with a transparent linear model, we’ve produced a working maintenance‑cost estimator in fewer than fifty lines of code. The coefficients double as an instant league table of cost drivers, while the pipeline architecture sets the stage for future upgrades—regularised regressors, gradient‑boosted trees, or time‑series features—without throwing away your baseline.