Vehicle Repair Cost Prediction with Ridge Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

After a crash or major mechanical failure, an insurer (or fleet manager) must decide how much the repair will cost before work even begins. A solid up‑front estimate lets them:

set realistic reserves,
decide whether the car should be written off instead of fixed, and
Steer owners to the proper repair facility.

We’ll build a Ridge‑regression model—a linear model with an L2 penalty—that predicts a claim’s expected repair cost in USD from details already captured at first notice of loss:

vehicle age and market price band
odometer group
body style and size class
type and severity of damage (front, side, total, etc.)
driver age band
whether the airbags deployed
loss location (urban / rural)

Ridge keeps the maths linear and easy to explain, while shrinking unstable coefficients that often crop up when dozens of related damage codes are present.

Libraries Required

pandas # data wrangling
numpy # numerical utilities
matplotlib.pyplot # optional quick plots
scikit‑learn # preprocessing, RidgeCV, metrics
joblib # save the trained pipeline

Dataset Link

Swedish Motor Insurance Claims

Step by Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error, r2_score
import joblib

2. Load the Kaggle file

df = pd.read_csv("Swedish_Motor_Insurance_Claims.csv")    # after un‑zipping
print(df.head())

Key columns in this public file

field	description
claim_amount	target – estimated repair payout (USD)
vehicle_price_band	Low / Medium / High
vehicle_age_years	numeric
odometer_group	0‑10 k / 10‑20 k …
vehicle_body	Sedan / SUV / Hatch …
damage_area	Front / Rear / Side / Total
airbag_deployed	0 / 1
driver_age_band	<25 / 25‑40 / 40‑60 / 60+
loss_region	Urban / Rural

3. Clean and split features

Age ranges from 0–20 years, while dummies are 0/1. Scaling prevents Ridge from disproportionately penalising age just because of numeric magnitude.

core = ['claim_amount','vehicle_price_band','vehicle_age_years',
        'odometer_group','vehicle_body','damage_area',
        'airbag_deployed','driver_age_band','loss_region']

df = df.dropna(subset=core).copy()

num_cols = ['vehicle_age_years']
cat_cols = ['vehicle_price_band','odometer_group','vehicle_body',
            'damage_area','airbag_deployed','driver_age_band',
            'loss_region']

X = df[num_cols + cat_cols]
y = df['claim_amount']

4. Pre‑processing + Ridge pipeline

Claim files contain dozens of categorical codes. One‑hotting turns each into a 0/1 flag; Ridge’s L2 penalty keeps the army of flags from blowing up the variance.
RidgeCV tries several regularisation strengths and picks the one with the lowest validation error – saving you manual grid‑search.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

alphas = [0.1, 1, 10, 50, 100]          # candidate L2 strengths
ridge   = RidgeCV(alphas=alphas, cv=5)

pipe = Pipeline([
        ('prep',  preprocess),
        ('ridge', ridge)
])

5. Train–test split & fit

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Performance check

A mean absolute error of $530 tells the adjuster how much financial buffer to keep when approving a repair estimate.

pred = pipe.predict(X_test)

print(f"α selected by CV : {pipe.named_steps['ridge'].alpha_}")
print(f"R² on hold‑out   : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out  : ${mean_absolute_error(y_test, pred):,.0f}")

7. Which factors cost the most?

With standardised numerics, every coefficient is a dollar change: e.g., “airbag deployed adds $950 on average” or “SUV body style adds $430.” Such clarity is gold for actuaries.

Because vehicle_age_years was z‑scored, its coefficient reads as “$ change for a one‑standard‑deviation jump in age.” Each one‑hot coefficient is the dollar uplift (or discount) versus the reference category.

# rebuild feature names after one‑hot encoding
ohe = pipe.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = pd.Series(pipe.named_steps['ridge'].coef_,
                  index=feature_names).sort_values()

print("\nTop cost *reducers*:")
print(coefs.head(8))

print("\nTop cost *drivers*:")
print(coefs.tail(8))

8. Save the model for production use

joblib.dump(pipe, "ridge_vehicle_repair_cost.pkl")

Summary

A Ridge‑regression pipeline converts raw first‑notice‑of‑loss fields into an instant repair‑cost estimate.
Regularisation tames multicollinearity among overlapping damage codes while keeping the model linear, fast, and interpretable.
The resulting coefficients highlight proper cost drivers—body style, damage area, airbag deployment—so claims teams know exactly where to focus negotiation with repair shops.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook