Insurance Cost Trend Prediction using Polynomial Regression in ML

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

Actuaries and underwriting teams need to anticipate individual insurance cost trends—specifically, how a policyholder’s annual medical‐claims charges (USD) evolve with risk factors—before renewing premiums. Historical policy data show that charges depend nonlinearly on age, body‐mass index (BMI), number of dependents, smoking status, and region. For example, costs accelerate with older age and higher BMI but plateau in specific risk brackets. A simple linear model underfits these curvatures, while an unconstrained high‐degree polynomial overfits outliers. Application of Polynomial Regression on engineered features with ℓ² regularisation (Ridge) can model smooth, nonlinear cost trends and deliver interpretable forecasts to guide premium adjustments and risk mitigation.

Libraries Required

import pandas as pd                      # data loading & handling  
import numpy as np                       # numerical operations  

import matplotlib.pyplot as plt          # plotting  
import seaborn as sns                    # visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder  
from sklearn.compose import ColumnTransformer  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Load Data & Libraries

import pandas as pd

# Load the dataset
df = pd.read_csv("data/insurance.csv")

# Preview
df.head()[['age','bmi','children','smoker','region','charges']]

Explore Nonlinear Trends

import seaborn as sns
import matplotlib.pyplot as plt

# Age vs Charges
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, alpha=0.6)
plt.title("Age vs Insurance Charges")
plt.xlabel("Age")
plt.ylabel("Charges (USD)")
plt.show()

Define Features & Target

Expands the seven transformed inputs into squared and interaction terms (e.g., age², age×bmi, smoker_yes×bmi²), capturing curvature and synergy effects.

# Categorical encoding: smoker (yes/no) and region
cat_cols = ['smoker','region']
num_cols = ['age','bmi','children']

X = df[num_cols + cat_cols]
y = df['charges']

Build Polynomial Regression Pipeline

StandardScaler z‑scales numeric predictors (age, bmi, children).
OneHotEncoder transforms smoker and region into binary flags.
Ridge: Applies an ℓ² penalty to control overfitting from the high‑dimensional polynomial feature space.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer([
    ('num', StandardScaler(),             num_cols),
    ('cat', OneHotEncoder(drop='first'),  cat_cols)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ridge', Ridge(random_state=42))
])

Train/Test Split & Hyperparameter Search

GridSearchCV: Tunes polynomial degree (1–3) and regularisation strength α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE on held‑out folds.

from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best parameters:", gs.best_params_)

Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: ${rmse:,.2f}")
print(f"Test R²  : {r2:.3f}")

Inspect Key Polynomial Coefficients

Ranks the most influential polynomial features—guiding underwriters on how age, BMI, smoking status and their interactions drive claim‐cost trends.

# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
# Get names from preprocessor
num_features = num_cols
cat_features = gs.best_estimator_.named_steps['prep'] \
                    .named_transformers_['cat'] \
                    .get_feature_names_out(cat_cols).tolist()
all_inputs = num_features + cat_features

feat_names = poly.get_feature_names_out(input_features=all_inputs)
coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)

# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Charges")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()

Summary

This Polynomial Regression pipeline:

1. Accurately predicts nonlinear cost trends (low RMSE, high R²) for individual policyholders.

2. Controls model complexity via Ridge shrinkage, balancing bias‑variance in the expanded feature space.

3. Yields interpretable insights: top polynomial features (e.g., age², bmi×smoker_yes) reveal actionable risk drivers to inform premium setting and targeted wellness programs.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

Insurance Cost Trend Prediction using Polynomial Regression in ML

Libraries Required

Dataset

Step-by-Step Code Implementation

Load Data & Libraries

Explore Nonlinear Trends

Define Features & Target

Build Polynomial Regression Pipeline

Train/Test Split & Hyperparameter Search

Evaluate Model

Inspect Key Polynomial Coefficients