Hospital Treatment Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Healthcare insurance providers need precise cost estimates for patient treatment to manage budgets, set premiums, and identify cost drivers. In this project, we will anticipate individual medical charges based on patient location and health indicators (age, sex, BMI, smoking status, region, etc.). By employing stepwise regression, we aim to find the most impactful variables and develop a concise linear model that provides reliable cost forecasts with clear interpretability—allowing stakeholders to find the factors affecting medical expenses.

Libraries Required

import pandas as pd               # Data manipulation
import numpy as np                # Numerical operations
import statsmodels.api as sm      # Statistical modeling
from sklearn.model_selection import train_test_split   # Data splitting
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation
import matplotlib.pyplot as plt   # Visualization

Dataset

Medical Cost Personal Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import the insurance dataset, which contains demographic and clinical variables, as well as individual medical charges. We examine its structure and summary statistics to understand the distributions of the variables.

# Block 1: Load dataset
url = "https://www.kaggle.com/mirichoi0218/insurance/download"
df = pd.read_csv(url)

# Inspect data
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Categorical columns (sex, smoker, region) are converted into binary dummy variables. We verify there are no missing values to ensure a clean modelling pipeline. The features (X) exclude the charges column, which serves as our response (y). We split the data into training and testing subsets for unbiased evaluation.

# Block 2: One‑hot encode categorical variables
df_enc = pd.get_dummies(df, columns=["sex", "smoker", "region"], drop_first=True)

# No missing values in this dataset; confirm nonetheless
assert df_enc.isnull().sum().sum() == 0

# Define predictors and target
X = df_enc.drop("charges", axis=1)
y = df_enc["charges"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

Stepwise Regression Function

The stepwise_selection function alternates between forward inclusion (adding the most statistically significant variable at a p‑value threshold of 0.01) and backward elimination (removing the least significant variable at a p-value threshold of 0.05). Iteration continues until no further variables qualify for addition or removal.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y, 
                       init_list=[], 
                       thresh_in=0.01, 
                       thresh_out=0.05, 
                       verbose=True):
    included = list(init_list)
    while True:
        changed = False
        # Forward step: consider adding each excluded variable
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(dtype=float, index=excluded)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < thresh_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.6f}")

        # Backward step: consider removing variables
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > thresh_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")
        if not changed:
            break
    return included

Model Building & Evaluation

Model Fitting: Using the selected subset of predictors, we fit an Ordinary Least Squares regression model via statsmodels. The .summary() output provides coefficients, standard errors, p‑values, and goodness‑of‑fit statistics.
Evaluation: We apply the fitted model to the test data and calculate R² (explained variance) and RMSE (root-mean-square error) to quantify performance.

# Block 4: Select features via stepwise regression
selected = stepwise_selection(X_train, y_train)

# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()

# Review summary
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

Plotting residuals against predicted values helps check for non‑random patterns, heteroscedasticity, or outliers—key assumptions in linear regression.

# Block 5: Residual plot
resid = y_test - y_pred
plt.scatter(y_pred, resid)
plt.axhline(0, color="black", linestyle="--")
plt.xlabel("Predicted Charges")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted")
plt.show()

Summary

Applying stepwise regression to the medical cost dataset yields a focused linear model that highlights major cost drivers—such as age, BMI, smoking status, and regional factors—while discarding less informative variables. The resulting model achieves strong explanatory power (high R²) and low prediction error (RMSE) on unseen data. This parsimonious approach not only streamlines feature interpretation for healthcare decision‑makers but also supports transparent forecasting of treatment costs for budgeting, policy design, and risk assessment.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook