Patient Recovery Time Prediction using Quantile Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Length of hospital stay (LoS) is a major factor of both clinical outcomes and operational costs. Traditional models estimate the average LoS, but planners also need to anticipate variability—from fast recoveries (e.g., 25th percentile) to prolonged stays (e.g., 75th percentile)—in order to allocate beds, staff, and post‑discharge resources efficiently.
In this patient recovery time prediction ML project, we’ll predict the 25th, 50th, and 75th percentiles of patient LoS (in days) based on admission and demographic features (age, sex, admission type, primary diagnosis category, comorbidity score) by fitting separate quantile regression models. These predictions will help hospital administrators plan for best‑case throughput, typical demand, and worst‑case capacity needs.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts import matplotlib.pyplot as plt # Residual diagnostics
Dataset
Hospital Length of Stay Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We import a 100k‑record dataset of hospitalized patients, containing demographics, admission details, primary diagnosis category, comorbidity scores, and observed Length_of_Stay_Days. We inspect types and summary stats to confirm no critical gaps.
# Load the Hospital Length of Stay dataset (Microsoft) :contentReference[oaicite:0]{index=0}
df = pd.read_csv("hospital_length_of_stay_dataset_microsoft.csv")
# Quick look
print(df.head())
print(df.info())
print(df['Length_of_Stay_Days'].describe())
Preprocessing & Feature Engineering
- We drop any rows missing core predictors or the response.
- We one‑hot encode categorical fields (Sex, Admission_Type, Primary_Diagnosis) to convert them into numeric features.
- We rename the target column to LoS and exclude any identifier columns.
# Drop missing values in key columns
df = df.dropna(subset=[
'Age','Sex','Admission_Type','Primary_Diagnosis',
'Comorbidity_Score','Length_of_Stay_Days'
])
# Simplify categories if needed (e.g., group rare diagnoses)
# For brevity, assume 'Primary_Diagnosis' has manageable cardinality
# One‑hot encode categorical predictors
df_enc = pd.get_dummies(df,
columns=['Sex','Admission_Type','Primary_Diagnosis'],
drop_first=True
)
# Define predictors and response
features = [col for col in df_enc.columns
if col not in ['Length_of_Stay_Days','Patient_ID']]
df_enc.rename(columns={'Length_of_Stay_Days':'LoS'}, inplace=True)
Train/Test Split
Using an 80/20 random split, we reserve 20% of patient records for honest, out‑of‑sample evaluation of our quantile models.
# Reserve 20% for out‑of‑sample evaluation
train, test = train_test_split(
df_enc[['LoS'] + features],
test_size=0.2,
random_state=42
)
Fit Quantile Regression Models
For each desired percentile (25th, 50th, 75th):
- We build a regression formula linking LoS to all predictors.
- We fit a QuantReg model at that quantile on the training set.
- We print the coefficient table (.summary().tables[1]), which shows how each feature’s marginal effect on hospital stay length varies across the lower, median, and upper tails—e.g., emergency admissions may add fewer days at the 25th percentile but substantially more at the 75th percentile.
quantiles = [0.25, 0.50, 0.75]
models = {}
formula = "LoS ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
models[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1])
Evaluation with Pinball Loss
- We generate quantile‑specific LoS predictions on the test set.
- We compute pinball loss for each quantile—a specialized loss function that penalizes under‑ and over‑predictions asymmetrically according to the target quantile. Lower pinball loss means more accurate, well‑calibrated quantile forecasts.
for q, res in models.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['LoS'], preds, alpha=q)
print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")
Residual Diagnostics
As an example, we plot residuals versus predicted values for the median model to check for non‑random patterns or heteroscedasticity, validating the linear quantile regression assumptions.
# Example: residuals for the 50th‑percentile model
preds_med = models[0.50].predict(test[features])
resid_med = test['LoS'] - preds_med
plt.scatter(preds_med, resid_med, alpha=0.4)
plt.axhline(0, linestyle='--', color='gray')
plt.xlabel("Predicted Median LoS (days)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Median LoS")
plt.show()
Summary
By applying quantile regression to hospital stay data, we obtain distribution‑aware forecasts of patient LoS:
- The 25th‑percentile model highlights drivers of rapid discharges—crucial for optimizing throughput and day‑case planning.
- The median (50th‑percentile) model provides typical LoS estimates for routine capacity management.
- The 75th‑percentile model focuses on prolonged stays—informing contingency bed‑management and post‑acute care arrangements.
These tailored quantile forecasts help hospital administrators and planners with insights like supporting efficient resource allocation, dynamic bed scheduling, and robust budgeting under variable patient recovery patterns.