Health Insurance Premium Prediction using Polynomial Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Actuarial teams and insurers need to predict individual annual health‐insurance premiums (USD) using applicant features known at underwriting: age, body mass index (BMI), number of dependents, smoking status, region, and past claims count. The relationship between these variables and required premiums is nonlinear—premiums rise sharply with age and BMI beyond risk thresholds, smokers incur disproportionate costs, and dependents interact with family plan discounts. A naïve linear model underfits these risk curves, while an unregularised polynomial overfits outliers. By applying Polynomial Regression on engineered features with ℓ² regularisation (Ridge), we can capture smooth risk‐load curvatures and deliver accurate, interpretable premium forecasts.

Libraries Required

import pandas as pd                             # data handling  
import numpy as np                              # numerical operations  

import matplotlib.pyplot as plt                 # plotting  
import seaborn as sns                           # visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder  
from sklearn.compose import ColumnTransformer  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Load Data & Inspect

import pandas as pd

# Load the CSV
df = pd.read_csv("data/insurance.csv")

# Preview relevant columns
df[['age','bmi','children','smoker','region','charges']].head()

Target Engineering & EDA

import seaborn as sns
import matplotlib.pyplot as plt

# We'll treat 'charges' as the baseline premium
# Visualise nonlinear trend: age vs charges
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, alpha=0.6)
plt.title("Age vs Medical Charges by Smoking Status")
plt.xlabel("Age")
plt.ylabel("Annual Charges (USD)")
plt.show()

Define Features & Target

Expands inputs into squares and interactions (e.g., age², bmi×smoker_yes), capturing curvature and risk multipliers.

# Categorical: smoker, region
# Numeric: age, bmi, children
cat_cols = ['smoker','region']
num_cols = ['age','bmi','children']

X = df[num_cols + cat_cols]
y = df['charges']  # use charges as predicted premium

Build Polynomial Regression Pipeline

Standard Scaler z‑scores age, BMI, and children, so all numeric features contribute equally.
OneHotEncoder converts smoker and region into binary flags.
Ridge Regression: Applies an ℓ² penalty to shrink high‑order coefficients, preventing overfitting in the expanded feature space.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

# Preprocess: scale numeric, one-hot encode categories
preprocessor = ColumnTransformer([
    ('num', StandardScaler(),             num_cols),
    ('cat', OneHotEncoder(drop='first'),  cat_cols)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ridge', Ridge(random_state=42))
])

Train/Test Split & Hyperparameter Search

degree: controls polynomial complexity (1 = linear, 2 = quadratic, 3 = cubic).
alpha: ℓ² regularisation strength from 10⁻³ to 10³.
GridSearchCV with 5‑fold CV selects the combination minimising RMSE.

from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best parameters:", gs.best_params_)

Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: ${rmse:,.2f}")
print(f"Test R²  : {r2:.3f}")

Inspect Key Polynomial Coefficients

Coefficient magnitudes reveal which nonlinear terms—such as age², bmi×smoker_yes, or children²—most influence predicted premiums, providing actionable insights for pricing and underwriting guidelines.

# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
# Build input feature list post-preprocessing
cat_feats = gs.best_estimator_.named_steps['prep'] \
                    .named_transformers_['cat'] \
                    .get_feature_names_out(cat_cols)
input_feats = num_cols + list(cat_feats)

feat_names = poly.get_feature_names_out(input_features=input_feats)
coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)

plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Premiums")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()

Summary

This Polynomial Regression workflow with Ridge regularisation delivers a robust, interpretable model to forecast health‐insurance premiums:

1. Accurately captures nonlinear risk effects (smoking, age, BMI) while controlling overfitting.

2. Balances complexity and generalisation through grid‐searched polynomial degree and ℓ² penalty.

3. Provides clear insights via top polynomial features—enabling actuaries to understand how age², interaction of BMI with smoking status, and family size shape premium recommendations.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

Health Insurance Premium Prediction using Polynomial Regression in ML

Libraries Required

Dataset

Step-by-Step Code Implementation

Load Data & Inspect

Target Engineering & EDA

Define Features & Target

Build Polynomial Regression Pipeline

Train/Test Split & Hyperparameter Search

Evaluate Model

Inspect Key Polynomial Coefficients