Medical Treatment Cost Prediction using Quantile Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

Healthcare payers and providers often budget based on the average expected treatment cost, but extreme cases—very low‐cost or very high‐cost patients—can have outsized impacts on budgets and risk pools. In the medical treatment cost prediction project, we’ll predict the 25th, 50th, and 75th percentiles of individual annual medical charges (charges) using patient demographics and clinical features (age, sex, BMI, number of children, smoking status, region).

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf    # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss function for quantile forecasts  

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We load a public dataset of 1,338 U.S. individuals’ demographics and annual medical charges. We inspect its schema (.info()) and the distribution of charges to understand its range and variance.

# Load the Medical Cost Personal dataset from Kaggle
# URL: https://www.kaggle.com/datasets/mirichoi0218/insurance
df = pd.read_csv("insurance.csv")

# Inspect structure and summary statistics
print(df.head())
print(df.info())
print(df['charges'].describe())

Preprocessing & Feature Encoding

  • We map sex and smoker to binary indicators for modeling.
  • We one‑hot encode the four region categories, dropping one to prevent multicollinearity.
  • We assemble a features list containing five core patient attributes plus the region dummies, then rename charges to Cost.
# Map binary categories and one‑hot encode region
df['sex']    = df['sex'].map({'female': 0, 'male': 1})
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})
df = pd.get_dummies(df, columns=['region'], drop_first=True)

# Define predictors and rename target
features = ['age', 'sex', 'bmi', 'children', 'smoker'] + \
           [col for col in df.columns if col.startswith('region_')]
df.rename(columns={'charges': 'Cost'}, inplace=True)

# Drop any remaining missing values (should be none)
df = df.dropna(subset=features + ['Cost'])

Train/Test Split

We randomly reserve 20% of the data for out‑of‑sample evaluation, ensuring our quantile regression models generalize to new patients.

# Reserve 20% for evaluation
train, test = train_test_split(df[features + ['Cost']],
                               test_size=0.2,
                               random_state=42)

Fit Quantile Regression Models

For each chosen quantile (25th, 50th, 75th percentiles):

  • We build a formula string (“Cost ~ age + sex + bmi + …”).
  • We fit a QuantReg model at that percentile on the training data.
  • We print only the coefficient table, showcasing how each predictor’s effect varies across the lower, median, and upper cost distribution (e.g., a smoker may add far more to the 75th‑percentile cost than to the median).
quantiles = [0.25, 0.50, 0.75]
results   = {}
formula   = "Cost ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    results[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # show coefficient table only

Evaluation with Pinball Loss

  • We generate quantile‑specific cost predictions on the test set.
  • We compute pinball loss for each quantile—an asymmetric loss function penalizing under‑ and over‑predictions relative to the target quantile—to assess forecast accuracy. Lower pinball loss indicates better alignment of predicted and actual cost quantiles.
for q, res in results.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Cost'], preds, alpha=q)
    print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")

Summary

Quantile regression provides tail‑aware insights into medical cost drivers:

  • The 25th‑percentile model highlights factors influencing lower‐cost patients (e.g., healthy demographics).
  • The median (50th‑percentile) model captures typical cost scenarios for budgeting routine expenditures.
  • The 75th‑percentile model focuses on high‐cost cases (e.g., with comorbidities or smokers), informing risk reserves and targeted interventions.

By modeling multiple quantiles, healthcare administrators and payers gain a distribution‑aware forecasting tool—enabling conservative budgeting for lower‐cost scenarios, planning around typical costs, and provisioning for high‐cost outliers, ultimately strengthening financial resilience and policy design.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *