Patient Recovery Time Prediction using Quantile Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Length of hospital stay (LoS) is a major factor of both clinical outcomes and operational costs. Traditional models estimate the average LoS, but planners also need to anticipate variability—from fast recoveries (e.g., 25th percentile) to prolonged stays (e.g., 75th percentile)—in order to allocate beds, staff, and post‑discharge resources efficiently.

In this patient recovery time prediction ML project, we’ll predict the 25th, 50th, and 75th percentiles of patient LoS (in days) based on admission and demographic features (age, sex, admission type, primary diagnosis category, comorbidity score) by fitting separate quantile regression models. These predictions will help hospital administrators plan for best‑case throughput, typical demand, and worst‑case capacity needs.

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf       # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss for quantile forecasts  
import matplotlib.pyplot as plt              # Residual diagnostics

Dataset

Hospital Length of Stay Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We import a 100k‑record dataset of hospitalized patients, containing demographics, admission details, primary diagnosis category, comorbidity scores, and observed Length_of_Stay_Days. We inspect types and summary stats to confirm no critical gaps.

# Load the Hospital Length of Stay dataset (Microsoft) :contentReference[oaicite:0]{index=0}
df = pd.read_csv("hospital_length_of_stay_dataset_microsoft.csv")

# Quick look
print(df.head())
print(df.info())
print(df['Length_of_Stay_Days'].describe())

Preprocessing & Feature Engineering

We drop any rows missing core predictors or the response.
We one‑hot encode categorical fields (Sex, Admission_Type, Primary_Diagnosis) to convert them into numeric features.
We rename the target column to LoS and exclude any identifier columns.

# Drop missing values in key columns
df = df.dropna(subset=[
    'Age','Sex','Admission_Type','Primary_Diagnosis',
    'Comorbidity_Score','Length_of_Stay_Days'
])

# Simplify categories if needed (e.g., group rare diagnoses)
# For brevity, assume 'Primary_Diagnosis' has manageable cardinality

# One‑hot encode categorical predictors
df_enc = pd.get_dummies(df, 
    columns=['Sex','Admission_Type','Primary_Diagnosis'], 
    drop_first=True
)

# Define predictors and response
features = [col for col in df_enc.columns 
            if col not in ['Length_of_Stay_Days','Patient_ID']]
df_enc.rename(columns={'Length_of_Stay_Days':'LoS'}, inplace=True)

Train/Test Split

Using an 80/20 random split, we reserve 20% of patient records for honest, out‑of‑sample evaluation of our quantile models.

# Reserve 20% for out‑of‑sample evaluation
train, test = train_test_split(
    df_enc[['LoS'] + features], 
    test_size=0.2, 
    random_state=42
)

Fit Quantile Regression Models

For each desired percentile (25th, 50th, 75th):

We build a regression formula linking LoS to all predictors.
We fit a QuantReg model at that quantile on the training set.
We print the coefficient table (.summary().tables[1]), which shows how each feature’s marginal effect on hospital stay length varies across the lower, median, and upper tails—e.g., emergency admissions may add fewer days at the 25th percentile but substantially more at the 75th percentile.

quantiles = [0.25, 0.50, 0.75]
models    = {}
formula   = "LoS ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])

Evaluation with Pinball Loss

We generate quantile‑specific LoS predictions on the test set.
We compute pinball loss for each quantile—a specialized loss function that penalizes under‑ and over‑predictions asymmetrically according to the target quantile. Lower pinball loss means more accurate, well‑calibrated quantile forecasts.

for q, res in models.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['LoS'], preds, alpha=q)
    print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")

Residual Diagnostics

As an example, we plot residuals versus predicted values for the median model to check for non‑random patterns or heteroscedasticity, validating the linear quantile regression assumptions.

# Example: residuals for the 50th‑percentile model
preds_med = models[0.50].predict(test[features])
resid_med = test['LoS'] - preds_med

plt.scatter(preds_med, resid_med, alpha=0.4)
plt.axhline(0, linestyle='--', color='gray')
plt.xlabel("Predicted Median LoS (days)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Median LoS")
plt.show()

Summary

By applying quantile regression to hospital stay data, we obtain distribution‑aware forecasts of patient LoS:

The 25th‑percentile model highlights drivers of rapid discharges—crucial for optimizing throughput and day‑case planning.
The median (50th‑percentile) model provides typical LoS estimates for routine capacity management.
The 75th‑percentile model focuses on prolonged stays—informing contingency bed‑management and post‑acute care arrangements.

These tailored quantile forecasts help hospital administrators and planners with insights like supporting efficient resource allocation, dynamic bed scheduling, and robust budgeting under variable patient recovery patterns.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google | Facebook