Insurance Premium Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Insurance underwriters need to forecast a customer’s annual healthcare premium before policy renewal using early‑year indicators such as age, BMI, smoking status, number of dependents, and geographic region. Premiums reflect nonlinear risk factors (e.g., smokers face exponentially higher rates) and carry uncertainty from individual health variability and actuarial assumptions. A simple point‐estimate model overlooks this uncertainty, risking mispricing. By employing Insurance Premium Cost Prediction using Bayesian Regression in ML, we produce both:
1. A point estimate of the expected premium (charges).
2. A credible interval that quantifies forecast uncertainty—enabling risk‑aware underwriting and personalised pricing.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Import Libraries & Load Data
import pandas as pd
# Load the insurance dataset
df = pd.read_csv("data/insurance.csv")
# Preview the key columns
df.head()
Preprocessing & Train/Test Split
- One‑hot encoding of sex, smoker, and region creates binary flags, avoiding ordinal misinterpretation.
- Numeric predictors (age, bmi, children) are z‑scored so the priors on β operate uniformly.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# One‑hot encode categorical features
df = pd.get_dummies(df, columns=['sex','smoker','region'], drop_first=True)
# Define predictors and target
feature_cols = ['age','bmi','children',
'sex_male','smoker_yes',
'region_northwest','region_southeast','region_southwest']
X = df[feature_cols].values
y = df['charges'].values # annual insurance charges
# Split into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize numeric columns for stable MCMC
scaler = StandardScaler().fit(X_train[:, :3]) # only first three are numeric
X_train_s = X_train.copy()
X_train_s[:, :3] = scaler.transform(X_train[:, :3])
X_test_s = X_test.copy()
X_test_s[:, :3] = scaler.transform(X_test[:, :3])
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(0, 1000) allows broad intercept shifts.
- β ∼ Normal(0, 500) expresses moderate uncertainty on each standardised coefficient.
- σ ∼ HalfNormal(1000) enforces positive residual scale.
Likelihood: Observed charges ∼ Normal(μ, σ), with μ = α + X·β.
Inference:
- We draw 2,000 posterior samples (after 1,000 tuning) with target_accept=0.9 for stable convergence.
- Posterior predictive sampling from Y_obs yields a full predictive distribution for new inputs.
import pymc3 as pm
with pm.Model() as premium_model:
# Priors
α = pm.Normal("α", mu=0, sigma=1000)
β = pm.Normal("β", mu=0, sigma=500, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=1000)
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample posterior
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
Posterior means of α and β give point forecasts; mean absolute error (MAE) on the test set quantifies average prediction error.
import arviz as az
# Summarize posterior distributions
az.summary(trace, round_to=2)
# Posterior predictive sampling
with premium_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Compute point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions & Credible Intervals
By varying BMI and holding other features at their medians, we plot the posterior mean charge curve and its 94% Highest Posterior Density interval—revealing both the central trend and the uncertainty around premium cost as BMI changes.
import numpy as np
import matplotlib.pyplot as plt
# Example: vary BMI, hold other features at their median
bmi_grid = np.linspace(X_train_s[:,1].min(), X_train_s[:,1].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,1] = bmi_grid
with premium_model:
pm.set_data({"X": grid})
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Convert BMI back to original scale
bmi_orig = scaler.inverse_transform(
np.column_stack([grid[:,0], grid[:,1], grid[:,2]])
)[:,1]
plt.figure(figsize=(8,5))
plt.plot(bmi_orig, mean_pred, label="Posterior mean")
plt.fill_between(bmi_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s[:, :3])[:,1],
y_test,
color="k", alpha=0.5,
label="Test data"
)
plt.xlabel("BMI")
plt.ylabel("Insurance Charges (USD)")
plt.title("Bayesian Regression: Charges vs. BMI")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression workflow for insurance premium forecasting provides:
1. Point estimates of expected annual charges from policyholder features.
2. Credible intervals that quantify uncertainty in these forecasts—crucial for risk‑aware underwriting.
3. Actionable insights: insurers can tailor premiums and promotional offers, understand which risk factors drive cost the most, while accounting for prediction uncertainty in pricing and capital allocation.