Insurance Premium Prediction using Linear Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Health‑insurance firms and actuarial teams must set fair yet profitable premiums long before the first bill lands in the mailbox. A transparent model that links personal factors—such as age, body mass index (BMI), family size, smoking status, and region—to the annual insurance charge (in USD) helps underwriters price risk, regulators identify bias, and consumers understand what drives their quote.

Here, we develop a simple linear regression baseline that predicts an individual’s premium using only readily available enrollment data. While modern insurers often rely on ensemble models, beginning with a linear fit reveals first‑order cost drivers and provides a solid benchmark for any future refinement.

Libraries Required

  • pandas # data wrangling
  • numpy # numerical helpers
  • matplotlib.pyplot # quick visual checks
  • seaborn # optional correlation plots
  • scikit‑learn # preprocessing, model, metrics
  • joblib # persist the pipeline

Dataset Link

Medical Cost Personal Datasets

Step-by-Step Code Implementation

Why linear regression first? Premium tables historically apply additive surcharges—smoking adds X $, extra children add Y $—so a straight‑line model mirrors that logic and surfaces each driver’s dollar weight.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2.  Load the data

Download the CSV from Kaggle and point to the local path.

df = pd.read_csv("insurance.csv")
print(df.head())
# columns: age, sex, bmi, children, smoker, region, charges

3. Basic exploration & sanity checks

print(df.info())          # look for missing values
sns.pairplot(df[['age','bmi','children','charges']])
plt.show()

4. Standard scaling

  • One-hot encoding converts sex, smoker, and four US regions into neutral binary flags, thereby preventing spurious numeric ordering.
  • Standard scaling on numeric inputs keeps age, BMI, and children count on a comparable footing, making the coefficient magnitudes meaningful.
num_cols = ['age', 'bmi', 'children']          # numeric predictors
cat_cols = ['sex', 'smoker', 'region']         # categorical predictors
target   = 'charges'

5.  Pre‑processing & model pipeline

# one‑hot encode categories, scale numeric columns
preprocessor = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                       num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',   preprocessor),
        ('model',  linreg)
])

6. Train‑test split & training

X = df[num_cols + cat_cols]	
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

pipe.fit(X_train, y_train)

7. Performance metrics

R² tells us what share of premium variance the six predictors capture, while MAE (absolute dollar error) shows the typical gap between our quote and reality—easy for actuaries to digest.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")

8. Inspect coefficients (cost drivers)

The coefficient table instantly reveals headline insights: a smoker flag usually dwarfs every other factor; a higher BMI adds cost in a roughly linear manner; and some regions trend cheaper or pricier once health-cost geography is taken into account.

# recover feature names from the encoder
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

feature_names = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)\
          .sort_values()

print("\nLeast‑expensive factors (negative impact on premium):")
print(coefs.head(10))
print("\nMost‑expensive factors (positive impact):")
print(coefs.tail(10))

9. Pipeline persistence

Pipeline persistence bundles preprocessing and coefficients in a single .pkl; tomorrow’s web form can call joblib.load and deliver on‑the‑spot quotes without code drift.

joblib.dump(pipe, "insurance_premium_linreg.pkl")

 Summary

This concise workflow converts a public enrollment file into an explainable insurance premium predictor. In a handful of lines, we:

  • Cleaned and encoded categorical attributes.
  • Fit and vetted a linear regression.
  • Quantified how each personal trait nudges the annual charge.

The resulting model serves three key roles: as a deployable baseline for instant quotes, a transparent check on fairness, and a benchmark against which more sophisticated algorithms must demonstrate their worth.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *