Healthcare Facility Cost Prediction using ElasticNet Algorithm in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Hospital finance teams must estimate total facility cost (USD) for each inpatient episode in advance—while the stay is still in progress—to tune case‑mix forecasts, manage DRG margins, and negotiate payer bundles. Historical SPARCS (New York State) data show that costs vary by age group, severity, procedure group, length of stay, admission source, insurance class, and hospital type. These variables are highly collinear (e.g., emergency admissions ↔ shorter booking time ↔ certain DRGs), so:
- Ordinary least‑squares gives unstable coefficients.
- Pure Lasso (ℓ¹) can over‑shrink and drop genuinely helpful categorical dummies.
Elastic Net blends Ridge’s ℓ² stability with Lasso’s sparsity, yielding a transparent yet robust model that predicts Facility_Cost_USD for unseen admissions.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visuals | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset
Healthcare Facility Cost Prediction
Step-by-Step Code Implementation
Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
Load dataset
Dataset: NY SPARCS inpatient discharge file summarised by DRG, severity, payer, etc., including facility “Cost” field.
df = pd.read_csv("NY_inpatient_cost.csv") # adjust to actual file name
print(df.head())
Target & predictors
Elastic Net necessity: Length of Stay correlates with Severity; DRG code aligns with Type of Admission and Age Group. The Ridge component stabilises these bundles, and the Lasso component zeros out rare dummy columns.
# Target variable – facility billed cost (USD)
y = df['Total Costs']
# Predictors available on admission
X = df[['Age Group', 'Gender', 'Race', 'Length of Stay',
'APR DRG Code', 'APR Severity of Illness', 'Type of Admission',
'Payment Typology 1', 'Hospital County']]
cat_cols = ['Age Group','Gender','Race','APR DRG Code',
'APR Severity of Illness','Type of Admission',
'Payment Typology 1','Hospital County']
num_cols = ['Length of Stay']
Elastic Net pipeline
Pipeline integrity: encoding & scaling occur within CV folds, eliminating leakage and enabling push‑button deployment (search.predict).
preprocess = ColumnTransformer([
('categorical', OneHotEncoder(drop='first'), cat_cols),
('numeric', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
Train/test split & hyper‑parameter grid search
Hyper‑grid: 162 models (18 α × 9 ratios) scanned; cross‑validated RMSE chooses the bias‑variance sweet spot.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['APR DRG Code'])
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
search = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])
Evaluate on the hold‑out set
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
Interpret top coefficients
Insights: coefficient plot quantifies, for example, +$3,500 for APR Severity = “Extreme”, −$800 for Medicaid payer after controlling for other factors, and +$1,200 per extra hospital day.
# Recover full feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['categorical']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# Un‑scale numeric coefficient(s)
scale = search.best_estimator_.named_steps['prep'].named_transformers_['numeric'].scale_
coef = search.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale
imp = (pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False)
.head(20))
plt.figure(figsize=(9,6))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net – Key Drivers of Inpatient Cost')
plt.xlabel('Δ Cost (USD)'); plt.tight_layout(); plt.show()
Summary
With ~130 lines of Python, we produced a robust Elastic Net model that:
- Predicts inpatient facility costs early with low error, enabling better budgeting and contract negotiations.
- Balances multicollinearity & sparsity—keeps correlated clinical drivers while dropping noise.
- Provides explainable dollar impacts on cost per severity, payer, county, and stay length for operational insight.
Updating the model each quarter is trivial: reload fresh SPARCS data, rerun search.fit(), and the entire forecasting pipeline stays up to date.