Hospital Cost Prediction using Quantile Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Traditional cost‐prediction models estimate the average hospital expenditure per patient, but healthcare budgets and insurance reimbursements require an understanding of cost variability—especially the high‐cost “tail” cases.

In this project, we will predict quantiles (e.g., 25th, 50th, 75th percentiles) of patient annual medical charges based on demographic and clinical features (age, sex, BMI, number of children, smoking status, region).

By fitting separate quantile regression models for each target percentile, we uncover how predictors influence low-, median-, and high-cost patients differently—enabling payers and hospital administrators to plan for typical cases and cap extreme expenditures.

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf    # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We import the “insurance” dataset (1,338 records) containing patient demographics and annual medical charges. Initial .info() and .describe() ensure that no critical fields are missing.

# Load the Medical Cost Personal dataset
# Source: Kaggle 
df = pd.read_csv("insurance.csv")

# Inspect structure and key statistics
print(df.head())
print(df.info())
print(df.describe(include='all'))

Preprocessing

  • sex and smoker are mapped to binaries.
  • region is one‑hot encoded (3 dummy variables).
  • We assemble features (age, sex, BMI, children, smoker, region dummies) and isolate the response charges.
# Encode categorical variables
df['sex']    = df['sex'].map({'female':0, 'male':1})
df['smoker'] = df['smoker'].map({'no':0, 'yes':1})
df = pd.get_dummies(df, columns=['region'], drop_first=True)

# Define predictors and target
features = ['age','sex','bmi','children','smoker'] + \
           [c for c in df.columns if c.startswith('region_')]
X = df[features]
y = df['charges']

Train/Test Split

We reserve 20% for out‑of‑sample evaluation, concatenating predictors and target into train and test DataFrames to simplify formula fitting.

# Hold out 20% for evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)
train = pd.concat([X_train, y_train], axis=1)
test  = pd.concat([X_test,  y_test],  axis=1)

Fit Quantile Regression Models

For each quantile (25th, 50th, 75th):

  • We construct a statsmodels formula linking charges to all predictors.
  • We fit a QuantReg model at that quantile.
  • We print the coefficient table—showing how each predictor’s marginal effect varies across cost percentiles (e.g., a smoker may increase the 75th‐percentile cost much more than the median).
quantiles = [0.25, 0.50, 0.75]
results  = {}

formula = "charges ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    results[q] = res
    print(f"\n=== Quantile {int(q*100)}th ===")
    print(res.summary().tables[1])  # coefficient table only

Evaluation via Pinball Loss

Using the held‑out test set, we compute pinball loss for each quantile—a proper scoring rule for quantile forecasts—to quantify fit quality and compare predictive accuracy across percentiles.

for q, res in results.items():
    preds = res.predict(X_test)
    loss  = mean_pinball_loss(y_test, preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

Quantile regression reveals the heterogeneous impact of patient attributes on different points of the cost distribution. For instance, smoking status might add $3,000 to a median patient’s charges but $6,000 to a high‑cost patient’s bill (75th percentile).

By modelling the 25th, 50th, and 75th percentiles separately, hospital financial planners gain nuanced forecasts: budgeting conservatively for lower‐cost cases, projecting typical expenditures, and provisioning for extreme scenarios.

The resulting interpretable linear models help balance cost containment with quality care for diverse patient groups.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *