Hospital Cost Prediction using Quantile Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Traditional cost‐prediction models estimate the average hospital expenditure per patient, but healthcare budgets and insurance reimbursements require an understanding of cost variability—especially the high‐cost “tail” cases.
In this project, we will predict quantiles (e.g., 25th, 50th, 75th percentiles) of patient annual medical charges based on demographic and clinical features (age, sex, BMI, number of children, smoking status, region).
By fitting separate quantile regression models for each target percentile, we uncover how predictors influence low-, median-, and high-cost patients differently—enabling payers and hospital administrators to plan for typical cases and cap extreme expenditures.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We import the “insurance” dataset (1,338 records) containing patient demographics and annual medical charges. Initial .info() and .describe() ensure that no critical fields are missing.
# Load the Medical Cost Personal dataset
# Source: Kaggle
df = pd.read_csv("insurance.csv")
# Inspect structure and key statistics
print(df.head())
print(df.info())
print(df.describe(include='all'))
Preprocessing
- sex and smoker are mapped to binaries.
- region is one‑hot encoded (3 dummy variables).
- We assemble features (age, sex, BMI, children, smoker, region dummies) and isolate the response charges.
# Encode categorical variables
df['sex'] = df['sex'].map({'female':0, 'male':1})
df['smoker'] = df['smoker'].map({'no':0, 'yes':1})
df = pd.get_dummies(df, columns=['region'], drop_first=True)
# Define predictors and target
features = ['age','sex','bmi','children','smoker'] + \
[c for c in df.columns if c.startswith('region_')]
X = df[features]
y = df['charges']
Train/Test Split
We reserve 20% for out‑of‑sample evaluation, concatenating predictors and target into train and test DataFrames to simplify formula fitting.
# Hold out 20% for evaluation
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)
Fit Quantile Regression Models
For each quantile (25th, 50th, 75th):
- We construct a statsmodels formula linking charges to all predictors.
- We fit a QuantReg model at that quantile.
- We print the coefficient table—showing how each predictor’s marginal effect varies across cost percentiles (e.g., a smoker may increase the 75th‐percentile cost much more than the median).
quantiles = [0.25, 0.50, 0.75]
results = {}
formula = "charges ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
results[q] = res
print(f"\n=== Quantile {int(q*100)}th ===")
print(res.summary().tables[1]) # coefficient table only
Evaluation via Pinball Loss
Using the held‑out test set, we compute pinball loss for each quantile—a proper scoring rule for quantile forecasts—to quantify fit quality and compare predictive accuracy across percentiles.
for q, res in results.items():
preds = res.predict(X_test)
loss = mean_pinball_loss(y_test, preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
Quantile regression reveals the heterogeneous impact of patient attributes on different points of the cost distribution. For instance, smoking status might add $3,000 to a median patient’s charges but $6,000 to a high‑cost patient’s bill (75th percentile).
By modelling the 25th, 50th, and 75th percentiles separately, hospital financial planners gain nuanced forecasts: budgeting conservatively for lower‐cost cases, projecting typical expenditures, and provisioning for extreme scenarios.
The resulting interpretable linear models help balance cost containment with quality care for diverse patient groups.