Patient Recovery Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Healthcare administrators need to estimate the cost of patient recovery—covering post‑operative care, medications, rehabilitation, and follow‑up visits—to budget resources and negotiate reimbursements. In this project, we will predict the total recovery cost for hospital inpatients based on demographic factors (age, gender), clinical indicators (length of stay, number of procedures, comorbidity index), and care metrics (ICU days, medication count). We will use stepwise regression to identify the most influential cost drivers and develop a linear model to help hospitals predict recovery expenditures.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation  
import matplotlib.pyplot as plt   # Visualization  

Dataset

Hospital Patient Records

Step-by-Step Code Implementation

Data Loading & Initial Inspection

Import a hospital records dataset that has patient details like location, clinical metrics, and total recovery costs. Initial commands (.info(), .describe()) verify data completeness and distributions.

# Block 1: Load dataset
# Hospital patient records with cost info :contentReference[oaicite:1]{index=1}
df = pd.read_csv("hospital_patient_records.csv")

# Initial inspection
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We remove records that are missing key predictors or the target. Gender is encoded as a binary indicator. Predictors include age, length of stay, procedure count, comorbidity index, ICU days, and medication count; the target is recovery cost. An 80/20 train–test split ensures unbiased evaluation.

# Block 2: Clean & encode
# Assume columns: 'Age', 'Gender', 'Length_of_Stay', 'Num_Procedures',
# 'Comorbidity_Index', 'ICU_Days', 'Medication_Count', 'Recovery_Cost'
df = df.dropna(subset=[
    'Age','Gender','Length_of_Stay','Num_Procedures',
    'Comorbidity_Index','ICU_Days','Medication_Count','Recovery_Cost'
])

# Encode gender
df['Gender_Male'] = (df['Gender'] == 'Male').astype(int)

# Define predictors and target
X = df[[
    'Age','Gender_Male','Length_of_Stay','Num_Procedures',
    'Comorbidity_Index','ICU_Days','Medication_Count'
]]
y = df['Recovery_Cost']

# Train–test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function alternates:

  • Forward inclusion: adds the excluded feature with the lowest p‑value below 0.01.
  • Backward elimination: removes the included feature with the highest p‑value above 0.05.
  • Iteration continues until no further changes occur, yielding a concise set of significant predictors.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False
        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:20} p-value {best_pval:.4f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:20} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

  • Model Fitting: Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The .summary() output provides coefficient estimates, p‑values, R², and diagnostic statistics—clarifying each predictor’s impact on cost.
  • Evaluation: Predictions on the test set yield R² (explained variance) and RMSE (prediction error), quantifying model performance on unseen data.
# Block 4: Feature selection
selected = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict & evaluate
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual plot checks for non‑random patterns or heteroscedasticity, validates OLS assumptions, and highlights any model misfit.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Recovery Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to patient recovery data, we isolate the most impactful cost drivers—such as length of stay, ICU days, and comorbidity index—while pruning less significant variables. The resulting linear model balances interpretability (few, statistically significant predictors) with strong predictive accuracy (high R², low RMSE), equipping healthcare planners with a transparent forecasting tool for patient recovery expenditures.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *