Insurance Premium Prediction using Linear Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Health‑insurance firms and actuarial teams must set fair yet profitable premiums long before the first bill lands in the mailbox. A transparent model that links personal factors—such as age, body mass index (BMI), family size, smoking status, and region—to the annual insurance charge (in USD) helps underwriters price risk, regulators identify bias, and consumers understand what drives their quote.
Here, we develop a simple linear regression baseline that predicts an individual’s premium using only readily available enrollment data. While modern insurers often rely on ensemble models, beginning with a linear fit reveals first‑order cost drivers and provides a solid benchmark for any future refinement.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick visual checks
- seaborn # optional correlation plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the pipeline
Dataset Link
Medical Cost Personal Datasets
Step-by-Step Code Implementation
Why linear regression first? Premium tables historically apply additive surcharges—smoking adds X $, extra children add Y $—so a straight‑line model mirrors that logic and surfaces each driver’s dollar weight.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
Download the CSV from Kaggle and point to the local path.
df = pd.read_csv("insurance.csv")
print(df.head())
# columns: age, sex, bmi, children, smoker, region, charges
3. Basic exploration & sanity checks
print(df.info()) # look for missing values sns.pairplot(df[['age','bmi','children','charges']]) plt.show()
4. Standard scaling
- One-hot encoding converts sex, smoker, and four US regions into neutral binary flags, thereby preventing spurious numeric ordering.
- Standard scaling on numeric inputs keeps age, BMI, and children count on a comparable footing, making the coefficient magnitudes meaningful.
num_cols = ['age', 'bmi', 'children'] # numeric predictors cat_cols = ['sex', 'smoker', 'region'] # categorical predictors target = 'charges'
5. Pre‑processing & model pipeline
# one‑hot encode categories, scale numeric columns
preprocessor = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preprocessor),
('model', linreg)
])
6. Train‑test split & training
X = df[num_cols + cat_cols]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
7. Performance metrics
R² tells us what share of premium variance the six predictors capture, while MAE (absolute dollar error) shows the typical gap between our quote and reality—easy for actuaries to digest.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")
8. Inspect coefficients (cost drivers)
The coefficient table instantly reveals headline insights: a smoker flag usually dwarfs every other factor; a higher BMI adds cost in a roughly linear manner; and some regions trend cheaper or pricier once health-cost geography is taken into account.
# recover feature names from the encoder
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
feature_names = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)\
.sort_values()
print("\nLeast‑expensive factors (negative impact on premium):")
print(coefs.head(10))
print("\nMost‑expensive factors (positive impact):")
print(coefs.tail(10))
9. Pipeline persistence
Pipeline persistence bundles preprocessing and coefficients in a single .pkl; tomorrow’s web form can call joblib.load and deliver on‑the‑spot quotes without code drift.
joblib.dump(pipe, "insurance_premium_linreg.pkl")
Summary
This concise workflow converts a public enrollment file into an explainable insurance premium predictor. In a handful of lines, we:
- Cleaned and encoded categorical attributes.
- Fit and vetted a linear regression.
- Quantified how each personal trait nudges the annual charge.
The resulting model serves three key roles: as a deployable baseline for instant quotes, a transparent check on fairness, and a benchmark against which more sophisticated algorithms must demonstrate their worth.