Healthcare Facility Cost Prediction using ElasticNet Algorithm in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Hospital finance teams must estimate total facility cost (USD) for each inpatient episode in advance—while the stay is still in progress—to tune case‑mix forecasts, manage DRG margins, and negotiate payer bundles. Historical SPARCS (New York State) data show that costs vary by age group, severity, procedure group, length of stay, admission source, insurance class, and hospital type. These variables are highly collinear (e.g., emergency admissions ↔ shorter booking time ↔ certain DRGs), so:

Ordinary least‑squares gives unstable coefficients.
Pure Lasso (ℓ¹) can over‑shrink and drop genuinely helpful categorical dummies.

Elastic Net blends Ridge’s ℓ² stability with Lasso’s sparsity, yielding a transparent yet robust model that predicts Facility_Cost_USD for unseen admissions.

Libraries Required

Purpose	Library
Data wrangling	pandas, numpy
Visuals	matplotlib, seaborn
ML pipeline	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset

Healthcare Facility Cost Prediction

Step-by-Step Code Implementation

Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

Load dataset

Dataset: NY SPARCS inpatient discharge file summarised by DRG, severity, payer, etc., including facility “Cost” field.

df = pd.read_csv("NY_inpatient_cost.csv")   # adjust to actual file name
print(df.head())

Target & predictors

Elastic Net necessity: Length of Stay correlates with Severity; DRG code aligns with Type of Admission and Age Group. The Ridge component stabilises these bundles, and the Lasso component zeros out rare dummy columns.

# Target variable – facility billed cost (USD)
y = df['Total Costs']

# Predictors available on admission
X = df[['Age Group', 'Gender', 'Race', 'Length of Stay',
        'APR DRG Code', 'APR Severity of Illness', 'Type of Admission',
        'Payment Typology 1', 'Hospital County']]

cat_cols = ['Age Group','Gender','Race','APR DRG Code',
            'APR Severity of Illness','Type of Admission',
            'Payment Typology 1','Hospital County']
num_cols = ['Length of Stay']

Elastic Net pipeline

Pipeline integrity: encoding & scaling occur within CV folds, eliminating leakage and enabling push‑button deployment (search.predict).

preprocess = ColumnTransformer([
    ('categorical', OneHotEncoder(drop='first'), cat_cols),
    ('numeric', StandardScaler(), num_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=20000, random_state=42))
])

Train/test split & hyper‑parameter grid search

Hyper‑grid: 162 models (18 α × 9 ratios) scanned; cross‑validated RMSE chooses the bias‑variance sweet spot.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=df['APR DRG Code'])

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

search = GridSearchCV(pipe, param_grid,
                      cv=5,
                      scoring='neg_root_mean_squared_error',
                      n_jobs=-1, verbose=1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])

Evaluate on the hold‑out set

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

Interpret top coefficients

Insights: coefficient plot quantifies, for example, +$3,500 for APR Severity = “Extreme”, −$800 for Medicaid payer after controlling for other factors, and +$1,200 per extra hospital day.

# Recover full feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['categorical']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

# Un‑scale numeric coefficient(s)
scale = search.best_estimator_.named_steps['prep'].named_transformers_['numeric'].scale_
coef  = search.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale

imp = (pd.Series(coef, index=feature_names)
         .sort_values(key=abs, ascending=False)
         .head(20))

plt.figure(figsize=(9,6))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net – Key Drivers of Inpatient Cost')
plt.xlabel('Δ Cost (USD)'); plt.tight_layout(); plt.show()

Summary

With ~130 lines of Python, we produced a robust Elastic Net model that:

Predicts inpatient facility costs early with low error, enabling better budgeting and contract negotiations.
Balances multicollinearity & sparsity—keeps correlated clinical drivers while dropping noise.
Provides explainable dollar impacts on cost per severity, payer, county, and stay length for operational insight.

Updating the model each quarter is trivial: reload fresh SPARCS data, rerun search.fit(), and the entire forecasting pipeline stays up to date.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook