Vehicle Repair Cost Prediction with Ridge Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

After a crash or major mechanical failure, an insurer (or fleet manager) must decide how much the repair will cost before work even begins. A solid up‑front estimate lets them:

  • set realistic reserves,
  • decide whether the car should be written off instead of fixed, and
  • Steer owners to the proper repair facility.

We’ll build a Ridge‑regression model—a linear model with an L2 penalty—that predicts a claim’s expected repair cost in USD from details already captured at first notice of loss:

  • vehicle age and market price band
  • odometer group
  • body style and size class
  • type and severity of damage (front, side, total, etc.)
  • driver age band
  • whether the airbags deployed
  • loss location (urban / rural)

Ridge keeps the maths linear and easy to explain, while shrinking unstable coefficients that often crop up when dozens of related damage codes are present.

Libraries Required

  • pandas # data wrangling
  • numpy # numerical utilities
  • matplotlib.pyplot # optional quick plots
  • scikit‑learn # preprocessing, RidgeCV, metrics
  • joblib # save the trained pipeline

Dataset Link

Swedish Motor Insurance Claims

Step by Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error, r2_score
import joblib

2. Load the Kaggle file

df = pd.read_csv("Swedish_Motor_Insurance_Claims.csv")    # after un‑zipping
print(df.head())

Key columns in this public file

field description
claim_amount target – estimated repair payout (USD)
vehicle_price_band Low / Medium / High
vehicle_age_years numeric
odometer_group 0‑10 k / 10‑20 k …
vehicle_body Sedan / SUV / Hatch …
damage_area Front / Rear / Side / Total
airbag_deployed 0 / 1
driver_age_band <25 / 25‑40 / 40‑60 / 60+
loss_region Urban / Rural

3. Clean and split features

Age ranges from 0–20 years, while dummies are 0/1. Scaling prevents Ridge from disproportionately penalising age just because of numeric magnitude.

core = ['claim_amount','vehicle_price_band','vehicle_age_years',
        'odometer_group','vehicle_body','damage_area',
        'airbag_deployed','driver_age_band','loss_region']

df = df.dropna(subset=core).copy()

num_cols = ['vehicle_age_years']
cat_cols = ['vehicle_price_band','odometer_group','vehicle_body',
            'damage_area','airbag_deployed','driver_age_band',
            'loss_region']

X = df[num_cols + cat_cols]
y = df['claim_amount']

4. Pre‑processing + Ridge pipeline

  • Claim files contain dozens of categorical codes. One‑hotting turns each into a 0/1 flag; Ridge’s L2 penalty keeps the army of flags from blowing up the variance.
  • RidgeCV tries several regularisation strengths and picks the one with the lowest validation error – saving you manual grid‑search.
preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

alphas = [0.1, 1, 10, 50, 100]          # candidate L2 strengths
ridge   = RidgeCV(alphas=alphas, cv=5)

pipe = Pipeline([
        ('prep',  preprocess),
        ('ridge', ridge)
])

5. Train–test split & fit

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Performance check

A mean absolute error of $530 tells the adjuster how much financial buffer to keep when approving a repair estimate.

pred = pipe.predict(X_test)

print(f"α selected by CV : {pipe.named_steps['ridge'].alpha_}")
print(f"R² on hold‑out   : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out  : ${mean_absolute_error(y_test, pred):,.0f}")

7. Which factors cost the most?

With standardised numerics, every coefficient is a dollar change: e.g., “airbag deployed adds $950 on average” or “SUV body style adds $430.” Such clarity is gold for actuaries.

Because vehicle_age_years was z‑scored, its coefficient reads as “$ change for a one‑standard‑deviation jump in age.” Each one‑hot coefficient is the dollar uplift (or discount) versus the reference category.

# rebuild feature names after one‑hot encoding
ohe = pipe.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = pd.Series(pipe.named_steps['ridge'].coef_,
                  index=feature_names).sort_values()

print("\nTop cost *reducers*:")
print(coefs.head(8))

print("\nTop cost *drivers*:")
print(coefs.tail(8))

8. Save the model for production use

joblib.dump(pipe, "ridge_vehicle_repair_cost.pkl")

Summary

  • A Ridge‑regression pipeline converts raw first‑notice‑of‑loss fields into an instant repair‑cost estimate.
  • Regularisation tames multicollinearity among overlapping damage codes while keeping the model linear, fast, and interpretable.
  • The resulting coefficients highlight proper cost drivers—body style, damage area, airbag deployment—so claims teams know exactly where to focus negotiation with repair shops.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *