Patient Care Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Healthcare providers and payers need accurate estimates of individual patient care costs—encompassing hospitalization, treatments, medications, and follow‑up—to budget effectively, negotiate reimbursements, and identify high‑cost drivers.

In the Patient care cost prediction project, we will predict the annual medical cost (charges) for patients based on demographic and clinical features (age, sex, BMI, number of dependents, smoking status, and region) by fitting a linear regression model with stepwise feature selection.

However, this parsimonious model that we will get will highlight the key factors that affect cost, enabling stakeholders to predict expenditures and design targeted interventions.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load the “Medical Cost Personal” dataset, which includes age, sex, BMI, children, smoker status, region, and annual medical charges for 1,338 U.S. individuals.

Initial .info() and .describe() confirm data types and identify outliers.

# Block 1: Load dataset  
# Medical Cost Personal Dataset – Kaggle   
url = "https://www.kaggle.com/datasets/mirichoi0218/insurance/download"  
df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df['charges'].describe())

Data Preprocessing

We remove any incomplete records and one‑hot encode sex, smoker, and region to convert them into numeric predictors.
We then separate features (X) from the target (charges) and perform an 80/20 train‑test split to ensure unbiased evaluation.

# Block 2: Clean & encode features
# Drop any rows with missing values
df = df.dropna(subset=['age','sex','bmi','children','smoker','region','charges'])

# One‑hot encode categorical variables
df_enc = pd.get_dummies(df,
                        columns=['sex','smoker','region'],
                        drop_first=True)

# Define predictors and target
X = df_enc.drop('charges', axis=1)
y = df_enc['charges']

# Split into training and test sets (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our stepwise_selection function alternates forward inclusion—adding the excluded feature with the lowest p -value below 0.01—and backward elimination.
Thus, removing the included feature with the highest p -value above 0.05 until no further changes occur.
As a result, this yields a small set of statistically significant predictors.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: evaluate adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            pval = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
            new_pvals[col] = pval
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:20} p-value {best_pval:.4f}")

        # Backward step: evaluate removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:20} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression on the selected features using statsmodels.
The .summary() output provides coefficient estimates (USD impact per unit change), p -values (significance), R², adjusted R², and diagnostic statistics (AIC, F‑statistic), clarifying each variable’s effect on cost.
Therefore, Predictions on the held‑out test set yield R² (variance explained) and RMSE (root‑mean‑square error), quantifying the model’s predictive accuracy on unseen data.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

We plot residuals versus predicted charges to check for non‑random patterns or heteroscedasticity—key assumptions of linear regression—and to detect any influential outliers.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Annual Charges (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Charges")
plt.show()

Summary

By applying stepwise regression to patient cost data, we isolate the most influential cost drivers—such as age, BMI, and smoking status—while pruning non‑informative predictors.

The resulting linear model balances interpretability (few, statistically significant features) with strong predictive performance (high test‑set R², low RMSE),

Therefore, equipping healthcare analysts and payers with transparent tool to forecast individual care expenses and guide policy and pricing decisions.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook