Insurance Cost Trend Prediction using Polynomial Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Actuaries and underwriting teams need to anticipate individual insurance cost trends—specifically, how a policyholder’s annual medical‐claims charges (USD) evolve with risk factors—before renewing premiums. Historical policy data show that charges depend nonlinearly on age, body‐mass index (BMI), number of dependents, smoking status, and region. For example, costs accelerate with older age and higher BMI but plateau in specific risk brackets. A simple linear model underfits these curvatures, while an unconstrained high‐degree polynomial overfits outliers. Application of Polynomial Regression on engineered features with ℓ² regularisation (Ridge) can model smooth, nonlinear cost trends and deliver interpretable forecasts to guide premium adjustments and risk mitigation.

Libraries Required

import pandas as pd                      # data loading & handling  
import numpy as np                       # numerical operations  

import matplotlib.pyplot as plt          # plotting  
import seaborn as sns                    # visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder  
from sklearn.compose import ColumnTransformer  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score 

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Load Data & Libraries

import pandas as pd

# Load the dataset
df = pd.read_csv("data/insurance.csv")

# Preview
df.head()[['age','bmi','children','smoker','region','charges']]

Explore Nonlinear Trends

import seaborn as sns
import matplotlib.pyplot as plt

# Age vs Charges
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, alpha=0.6)
plt.title("Age vs Insurance Charges")
plt.xlabel("Age")
plt.ylabel("Charges (USD)")
plt.show()

Define Features & Target

Expands the seven transformed inputs into squared and interaction terms (e.g., age², age×bmi, smoker_yes×bmi²), capturing curvature and synergy effects.

# Categorical encoding: smoker (yes/no) and region
cat_cols = ['smoker','region']
num_cols = ['age','bmi','children']

X = df[num_cols + cat_cols]
y = df['charges']

Build Polynomial Regression Pipeline

  • StandardScaler z‑scales numeric predictors (age, bmi, children).
  • OneHotEncoder transforms smoker and region into binary flags.
  • Ridge: Applies an ℓ² penalty to control overfitting from the high‑dimensional polynomial feature space.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer([
    ('num', StandardScaler(),             num_cols),
    ('cat', OneHotEncoder(drop='first'),  cat_cols)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ridge', Ridge(random_state=42))
])

Train/Test Split & Hyperparameter Search

GridSearchCV: Tunes polynomial degree (1–3) and regularisation strength α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE on held‑out folds.

from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best parameters:", gs.best_params_)

Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: ${rmse:,.2f}")
print(f"Test R²  : {r2:.3f}")

Inspect Key Polynomial Coefficients

Ranks the most influential polynomial features—guiding underwriters on how age, BMI, smoking status and their interactions drive claim‐cost trends.

# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
# Get names from preprocessor
num_features = num_cols
cat_features = gs.best_estimator_.named_steps['prep'] \
                    .named_transformers_['cat'] \
                    .get_feature_names_out(cat_cols).tolist()
all_inputs = num_features + cat_features

feat_names = poly.get_feature_names_out(input_features=all_inputs)
coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)

# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Charges")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()

Summary

This Polynomial Regression pipeline:

1. Accurately predicts nonlinear cost trends (low RMSE, high R²) for individual policyholders.

2. Controls model complexity via Ridge shrinkage, balancing bias‑variance in the expanded feature space.

3. Yields interpretable insights: top polynomial features (e.g., age², bmi×smoker_yes) reveal actionable risk drivers to inform premium setting and targeted wellness programs.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *