Insurance Premium Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Insurance underwriters need to forecast a customer’s annual healthcare premium before policy renewal using early‑year indicators such as age, BMI, smoking status, number of dependents, and geographic region. Premiums reflect nonlinear risk factors (e.g., smokers face exponentially higher rates) and carry uncertainty from individual health variability and actuarial assumptions. A simple point‐estimate model overlooks this uncertainty, risking mispricing. By employing Insurance Premium Cost Prediction using Bayesian Regression in ML, we produce both:

1. A point estimate of the expected premium (charges).

2. A credible interval that quantifies forecast uncertainty—enabling risk‑aware underwriting and personalised pricing.

Libraries Required

import pandas as pd                              # data loading & manipulation  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # enhanced visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error  

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Import Libraries & Load Data

import pandas as pd

# Load the insurance dataset
df = pd.read_csv("data/insurance.csv")

# Preview the key columns
df.head()

Preprocessing & Train/Test Split

  • One‑hot encoding of sex, smoker, and region creates binary flags, avoiding ordinal misinterpretation.
  • Numeric predictors (age, bmi, children) are z‑scored so the priors on β operate uniformly.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# One‑hot encode categorical features
df = pd.get_dummies(df, columns=['sex','smoker','region'], drop_first=True)

# Define predictors and target
feature_cols = ['age','bmi','children',
                'sex_male','smoker_yes',
                'region_northwest','region_southeast','region_southwest']
X = df[feature_cols].values
y = df['charges'].values  # annual insurance charges

# Split into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize numeric columns for stable MCMC
scaler = StandardScaler().fit(X_train[:, :3])  # only first three are numeric
X_train_s = X_train.copy()
X_train_s[:, :3] = scaler.transform(X_train[:, :3])
X_test_s  = X_test.copy()
X_test_s[:, :3]  = scaler.transform(X_test[:, :3])

Define & Fit Bayesian Regression Model

Priors:

  • α ∼ Normal(0, 1000) allows broad intercept shifts.
  • β ∼ Normal(0, 500) expresses moderate uncertainty on each standardised coefficient.
  • σ ∼ HalfNormal(1000) enforces positive residual scale.

Likelihood: Observed charges ∼ Normal(μ, σ), with μ = α + X·β.

Inference: 

  • We draw 2,000 posterior samples (after 1,000 tuning) with target_accept=0.9 for stable convergence.
  • Posterior predictive sampling from Y_obs yields a full predictive distribution for new inputs.
import pymc3 as pm

with pm.Model() as premium_model:
    # Priors
    α = pm.Normal("α", mu=0, sigma=1000)
    β = pm.Normal("β", mu=0, sigma=500, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=1000)

    # Linear predictor
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # Sample posterior
    trace = pm.sample(
        draws=2000,
        tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

Posterior means of α and β give point forecasts; mean absolute error (MAE) on the test set quantifies average prediction error.

import arviz as az

# Summarize posterior distributions
az.summary(trace, round_to=2)

# Posterior predictive sampling
with premium_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Compute point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")

Visualise Predictions & Credible Intervals

By varying BMI and holding other features at their medians, we plot the posterior mean charge curve and its 94% Highest Posterior Density interval—revealing both the central trend and the uncertainty around premium cost as BMI changes.

import numpy as np
import matplotlib.pyplot as plt

# Example: vary BMI, hold other features at their median
bmi_grid = np.linspace(X_train_s[:,1].min(), X_train_s[:,1].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,1] = bmi_grid

with premium_model:
    pm.set_data({"X": grid})
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Convert BMI back to original scale
bmi_orig = scaler.inverse_transform(
    np.column_stack([grid[:,0], grid[:,1], grid[:,2]])
)[:,1]

plt.figure(figsize=(8,5))
plt.plot(bmi_orig, mean_pred, label="Posterior mean")
plt.fill_between(bmi_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s[:, :3])[:,1],
    y_test,
    color="k", alpha=0.5,
    label="Test data"
)
plt.xlabel("BMI")
plt.ylabel("Insurance Charges (USD)")
plt.title("Bayesian Regression: Charges vs. BMI")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression workflow for insurance premium forecasting provides:

1. Point estimates of expected annual charges from policyholder features.

2. Credible intervals that quantify uncertainty in these forecasts—crucial for risk‑aware underwriting.

3. Actionable insights: insurers can tailor premiums and promotional offers, understand which risk factors drive cost the most, while accounting for prediction uncertainty in pricing and capital allocation.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *