Health Insurance Premium Prediction using Polynomial Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Actuarial teams and insurers need to predict individual annual health‐insurance premiums (USD) using applicant features known at underwriting: age, body mass index (BMI), number of dependents, smoking status, region, and past claims count. The relationship between these variables and required premiums is nonlinear—premiums rise sharply with age and BMI beyond risk thresholds, smokers incur disproportionate costs, and dependents interact with family plan discounts. A naïve linear model underfits these risk curves, while an unregularised polynomial overfits outliers. By applying Polynomial Regression on engineered features with ℓ² regularisation (Ridge), we can capture smooth risk‐load curvatures and deliver accurate, interpretable premium forecasts.
Libraries Required
import pandas as pd # data handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
Step-by-Step Code Implementation
Load Data & Inspect
import pandas as pd
# Load the CSV
df = pd.read_csv("data/insurance.csv")
# Preview relevant columns
df[['age','bmi','children','smoker','region','charges']].head()
Target Engineering & EDA
import seaborn as sns
import matplotlib.pyplot as plt
# We'll treat 'charges' as the baseline premium
# Visualise nonlinear trend: age vs charges
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, alpha=0.6)
plt.title("Age vs Medical Charges by Smoking Status")
plt.xlabel("Age")
plt.ylabel("Annual Charges (USD)")
plt.show()
Define Features & Target
Expands inputs into squares and interactions (e.g., age², bmi×smoker_yes), capturing curvature and risk multipliers.
# Categorical: smoker, region # Numeric: age, bmi, children cat_cols = ['smoker','region'] num_cols = ['age','bmi','children'] X = df[num_cols + cat_cols] y = df['charges'] # use charges as predicted premium
Build Polynomial Regression Pipeline
- Standard Scaler z‑scores age, BMI, and children, so all numeric features contribute equally.
- OneHotEncoder converts smoker and region into binary flags.
- Ridge Regression: Applies an ℓ² penalty to shrink high‑order coefficients, preventing overfitting in the expanded feature space.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
# Preprocess: scale numeric, one-hot encode categories
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(drop='first'), cat_cols)
])
pipe = Pipeline([
('prep', preprocessor),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
Train/Test Split & Hyperparameter Search
- degree: controls polynomial complexity (1 = linear, 2 = quadratic, 3 = cubic).
- alpha: ℓ² regularisation strength from 10⁻³ to 10³.
- GridSearchCV with 5‑fold CV selects the combination minimising RMSE.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.2f}")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Coefficient magnitudes reveal which nonlinear terms—such as age², bmi×smoker_yes, or children²—most influence predicted premiums, providing actionable insights for pricing and underwriting guidelines.
# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
# Build input feature list post-preprocessing
cat_feats = gs.best_estimator_.named_steps['prep'] \
.named_transformers_['cat'] \
.get_feature_names_out(cat_cols)
input_feats = num_cols + list(cat_feats)
feat_names = poly.get_feature_names_out(input_features=input_feats)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Premiums")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression workflow with Ridge regularisation delivers a robust, interpretable model to forecast health‐insurance premiums:
1. Accurately captures nonlinear risk effects (smoking, age, BMI) while controlling overfitting.
2. Balances complexity and generalisation through grid‐searched polynomial degree and ℓ² penalty.
3. Provides clear insights via top polynomial features—enabling actuaries to understand how age², interaction of BMI with smoking status, and family size shape premium recommendations.