Insurance Cost Trend Prediction using Polynomial Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Actuaries and underwriting teams need to anticipate individual insurance cost trends—specifically, how a policyholder’s annual medical‐claims charges (USD) evolve with risk factors—before renewing premiums. Historical policy data show that charges depend nonlinearly on age, body‐mass index (BMI), number of dependents, smoking status, and region. For example, costs accelerate with older age and higher BMI but plateau in specific risk brackets. A simple linear model underfits these curvatures, while an unconstrained high‐degree polynomial overfits outliers. Application of Polynomial Regression on engineered features with ℓ² regularisation (Ridge) can model smooth, nonlinear cost trends and deliver interpretable forecasts to guide premium adjustments and risk mitigation.
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
Step-by-Step Code Implementation
Load Data & Libraries
import pandas as pd
# Load the dataset
df = pd.read_csv("data/insurance.csv")
# Preview
df.head()[['age','bmi','children','smoker','region','charges']]
Explore Nonlinear Trends
import seaborn as sns
import matplotlib.pyplot as plt
# Age vs Charges
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, alpha=0.6)
plt.title("Age vs Insurance Charges")
plt.xlabel("Age")
plt.ylabel("Charges (USD)")
plt.show()
Define Features & Target
Expands the seven transformed inputs into squared and interaction terms (e.g., age², age×bmi, smoker_yes×bmi²), capturing curvature and synergy effects.
# Categorical encoding: smoker (yes/no) and region cat_cols = ['smoker','region'] num_cols = ['age','bmi','children'] X = df[num_cols + cat_cols] y = df['charges']
Build Polynomial Regression Pipeline
- StandardScaler z‑scales numeric predictors (age, bmi, children).
- OneHotEncoder transforms smoker and region into binary flags.
- Ridge: Applies an ℓ² penalty to control overfitting from the high‑dimensional polynomial feature space.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(drop='first'), cat_cols)
])
pipe = Pipeline([
('prep', preprocessor),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
Train/Test Split & Hyperparameter Search
GridSearchCV: Tunes polynomial degree (1–3) and regularisation strength α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE on held‑out folds.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.2f}")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Ranks the most influential polynomial features—guiding underwriters on how age, BMI, smoking status and their interactions drive claim‐cost trends.
# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
# Get names from preprocessor
num_features = num_cols
cat_features = gs.best_estimator_.named_steps['prep'] \
.named_transformers_['cat'] \
.get_feature_names_out(cat_cols).tolist()
all_inputs = num_features + cat_features
feat_names = poly.get_feature_names_out(input_features=all_inputs)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Charges")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression pipeline:
1. Accurately predicts nonlinear cost trends (low RMSE, high R²) for individual policyholders.
2. Controls model complexity via Ridge shrinkage, balancing bias‑variance in the expanded feature space.
3. Yields interpretable insights: top polynomial features (e.g., age², bmi×smoker_yes) reveal actionable risk drivers to inform premium setting and targeted wellness programs.