Hospital Treatment Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Healthcare insurance providers need precise cost estimates for patient treatment to manage budgets, set premiums, and identify cost drivers. In this project, we will anticipate individual medical charges based on patient location and health indicators (age, sex, BMI, smoking status, region, etc.). By employing stepwise regression, we aim to find the most impactful variables and develop a concise linear model that provides reliable cost forecasts with clear interpretability—allowing stakeholders to find the factors affecting medical expenses.
Libraries Required
import pandas as pd # Data manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Statistical modeling from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We import the insurance dataset, which contains demographic and clinical variables, as well as individual medical charges. We examine its structure and summary statistics to understand the distributions of the variables.
# Block 1: Load dataset url = "https://www.kaggle.com/mirichoi0218/insurance/download" df = pd.read_csv(url) # Inspect data print(df.head()) print(df.info()) print(df.describe())
Data Preprocessing
Categorical columns (sex, smoker, region) are converted into binary dummy variables. We verify there are no missing values to ensure a clean modelling pipeline. The features (X) exclude the charges column, which serves as our response (y). We split the data into training and testing subsets for unbiased evaluation.
# Block 2: One‑hot encode categorical variables
df_enc = pd.get_dummies(df, columns=["sex", "smoker", "region"], drop_first=True)
# No missing values in this dataset; confirm nonetheless
assert df_enc.isnull().sum().sum() == 0
# Define predictors and target
X = df_enc.drop("charges", axis=1)
y = df_enc["charges"]
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
Stepwise Regression Function
The stepwise_selection function alternates between forward inclusion (adding the most statistically significant variable at a p‑value threshold of 0.01) and backward elimination (removing the least significant variable at a p-value threshold of 0.05). Iteration continues until no further variables qualify for addition or removal.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
init_list=[],
thresh_in=0.01,
thresh_out=0.05,
verbose=True):
included = list(init_list)
while True:
changed = False
# Forward step: consider adding each excluded variable
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(dtype=float, index=excluded)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < thresh_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.6f}")
# Backward step: consider removing variables
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > thresh_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: Using the selected subset of predictors, we fit an Ordinary Least Squares regression model via statsmodels. The .summary() output provides coefficients, standard errors, p‑values, and goodness‑of‑fit statistics.
- Evaluation: We apply the fitted model to the test data and calculate R² (explained variance) and RMSE (root-mean-square error) to quantify performance.
# Block 4: Select features via stepwise regression
selected = stepwise_selection(X_train, y_train)
# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
# Review summary
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
Plotting residuals against predicted values helps check for non‑random patterns, heteroscedasticity, or outliers—key assumptions in linear regression.
# Block 5: Residual plot
resid = y_test - y_pred
plt.scatter(y_pred, resid)
plt.axhline(0, color="black", linestyle="--")
plt.xlabel("Predicted Charges")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted")
plt.show()
Summary
Applying stepwise regression to the medical cost dataset yields a focused linear model that highlights major cost drivers—such as age, BMI, smoking status, and regional factors—while discarding less informative variables. The resulting model achieves strong explanatory power (high R²) and low prediction error (RMSE) on unseen data. This parsimonious approach not only streamlines feature interpretation for healthcare decision‑makers but also supports transparent forecasting of treatment costs for budgeting, policy design, and risk assessment.