Vehicle Repair Cost Prediction with Ridge Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
After a crash or major mechanical failure, an insurer (or fleet manager) must decide how much the repair will cost before work even begins. A solid up‑front estimate lets them:
- set realistic reserves,
- decide whether the car should be written off instead of fixed, and
- Steer owners to the proper repair facility.
We’ll build a Ridge‑regression model—a linear model with an L2 penalty—that predicts a claim’s expected repair cost in USD from details already captured at first notice of loss:
- vehicle age and market price band
- odometer group
- body style and size class
- type and severity of damage (front, side, total, etc.)
- driver age band
- whether the airbags deployed
- loss location (urban / rural)
Ridge keeps the maths linear and easy to explain, while shrinking unstable coefficients that often crop up when dozens of related damage codes are present.
Libraries Required
- pandas # data wrangling
- numpy # numerical utilities
- matplotlib.pyplot # optional quick plots
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # save the trained pipeline
Dataset Link
Swedish Motor Insurance Claims
Step by Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import mean_absolute_error, r2_score import joblib
2. Load the Kaggle file
df = pd.read_csv("Swedish_Motor_Insurance_Claims.csv") # after un‑zipping
print(df.head())
Key columns in this public file
| field | description |
| claim_amount | target – estimated repair payout (USD) |
| vehicle_price_band | Low / Medium / High |
| vehicle_age_years | numeric |
| odometer_group | 0‑10 k / 10‑20 k … |
| vehicle_body | Sedan / SUV / Hatch … |
| damage_area | Front / Rear / Side / Total |
| airbag_deployed | 0 / 1 |
| driver_age_band | <25 / 25‑40 / 40‑60 / 60+ |
| loss_region | Urban / Rural |
3. Clean and split features
Age ranges from 0–20 years, while dummies are 0/1. Scaling prevents Ridge from disproportionately penalising age just because of numeric magnitude.
core = ['claim_amount','vehicle_price_band','vehicle_age_years',
'odometer_group','vehicle_body','damage_area',
'airbag_deployed','driver_age_band','loss_region']
df = df.dropna(subset=core).copy()
num_cols = ['vehicle_age_years']
cat_cols = ['vehicle_price_band','odometer_group','vehicle_body',
'damage_area','airbag_deployed','driver_age_band',
'loss_region']
X = df[num_cols + cat_cols]
y = df['claim_amount']
4. Pre‑processing + Ridge pipeline
- Claim files contain dozens of categorical codes. One‑hotting turns each into a 0/1 flag; Ridge’s L2 penalty keeps the army of flags from blowing up the variance.
- RidgeCV tries several regularisation strengths and picks the one with the lowest validation error – saving you manual grid‑search.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
alphas = [0.1, 1, 10, 50, 100] # candidate L2 strengths
ridge = RidgeCV(alphas=alphas, cv=5)
pipe = Pipeline([
('prep', preprocess),
('ridge', ridge)
])
5. Train–test split & fit
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
6. Performance check
A mean absolute error of $530 tells the adjuster how much financial buffer to keep when approving a repair estimate.
pred = pipe.predict(X_test)
print(f"α selected by CV : {pipe.named_steps['ridge'].alpha_}")
print(f"R² on hold‑out : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out : ${mean_absolute_error(y_test, pred):,.0f}")
7. Which factors cost the most?
With standardised numerics, every coefficient is a dollar change: e.g., “airbag deployed adds $950 on average” or “SUV body style adds $430.” Such clarity is gold for actuaries.
Because vehicle_age_years was z‑scored, its coefficient reads as “$ change for a one‑standard‑deviation jump in age.” Each one‑hot coefficient is the dollar uplift (or discount) versus the reference category.
# rebuild feature names after one‑hot encoding
ohe = pipe.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coefs = pd.Series(pipe.named_steps['ridge'].coef_,
index=feature_names).sort_values()
print("\nTop cost *reducers*:")
print(coefs.head(8))
print("\nTop cost *drivers*:")
print(coefs.tail(8))
8. Save the model for production use
joblib.dump(pipe, "ridge_vehicle_repair_cost.pkl")
Summary
- A Ridge‑regression pipeline converts raw first‑notice‑of‑loss fields into an instant repair‑cost estimate.
- Regularisation tames multicollinearity among overlapping damage codes while keeping the model linear, fast, and interpretable.
- The resulting coefficients highlight proper cost drivers—body style, damage area, airbag deployment—so claims teams know exactly where to focus negotiation with repair shops.