Medical Treatment Cost Prediction with Ridge Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Hospitals in the United States send payers (insurers, Medicare, patients) an itemised bill at the end of every inpatient stay. Knowing before a patient is discharged how big that bill is likely to be lets finance teams:

flag cases that may breach a DRG outlier threshold,
negotiate bundled‑payment exceptions early, and
Give patients realistic price estimates that satisfy transparency rules.

Using the public “Hospital Inpatient Charges” data released by CMS (Centres for Medicare & Medicaid Services), we will train a Ridge‑regression model that predicts a stay’s average total payment (USD) from information already available on admission:

Predictor family	Examples in the dataset
Hospital profile	Provider ID, State, Urban / Rural indicator
Clinical DRG	DRG code and short description
Case mix	Number of discharges, Mean length of stay
Payer mix	Medicare‑covered % (implicit in CMS data)

Ridge regression (linear + L2 penalty) keeps the model transparent—every coefficient is a dollar amount—while damping unstable weights that often appear when DRG codes and hospital dummies overlap.

Libraries Required

pandas # table wrangling
numpy # numerical helpers
matplotlib.pyplot # optional diagnostic plots
scikit‑learn # preprocessing, RidgeCV, metrics
joblib # save & reload the finished pipeline

Dataset Link

Hospital Inpatient Discharges Dataset

Step by Step Code Implementation

1. Import Libraries

Hospital & DRG dummies correlate (some states never perform certain DRGs). Ridge’s L2 penalty shrinks unstable coefficients, giving better generalisation and cleaner stories for finance teams.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the CMS dataset

Total Discharges (~1–2 000) and Ave LOS (1–30 days) sit on very different scales; scaling prevents Ridge from over‑penalising the larger‑magnitude feature.

# File from Kaggle link in the next section
df = pd.read_csv("Hospital_Inpatient_Discharges_Dataset.csv")
print(df.shape)
print(df.head())

Key raw fields (names in the 2016 file)

column	sample
Provider Id	330154
Provider State	NY
DRG Definition	“207 ‑ Respiratory system diagnosis w/o MCC”
Total Discharges	68
Average Length of Stay	4.9
Average Total Payments	← target (USD)

3. Minimal cleaning

Provider IDs and DRG descriptions are categorical with no inherent ordering; one‑hotting turns each into its intercept shift.

# keep rows with all essential columns
core = ['Average Total Payments', 'Provider Id', 'Provider State',
        'DRG Definition', 'Total Discharges', 'Average Length of Stay']
df = df.dropna(subset=core).copy()

# rename for convenience
df = df.rename(columns={
        'Average Total Payments': 'TotalPayUSD',
        'Provider Id':           'ProviderID',
        'Provider State':        'State',
        'DRG Definition':        'DRG'
})

4. Separate numeric & categorical features

num_cols = ['Total Discharges', 'Average Length of Stay']
cat_cols = ['ProviderID', 'State', 'DRG']
target   = 'TotalPayUSD'

X = df[num_cols + cat_cols]
y = df[target]

5. Pre‑processing & Ridge pipeline

Cross‑validated Ridge automatically finds the α that balances bias (under‑fit) and variance (over‑fit) across five folds—no manual grid search needed.

pre = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

alphas = [0.1, 1, 10, 50, 100, 250]   # candidate L2 strengths
ridge  = RidgeCV(alphas=alphas, cv=5)

pipe = Pipeline([
        ('prep',  pre),
        ('model', ridge)
])

6. Train‑test split & fitting

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Performance on unseen hospitals

pred = pipe.predict(X_test)
print(f"Chosen α (L2)      : {pipe.named_steps['model'].alpha_}")
print(f"R² on hold‑out set : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out    : ${mean_absolute_error(y_test, pred):,.0f}")

8. Which factors push costs?

Coefficients remain dollar figures. Example: a +$3 950 weight on DRG_870‑Septicemia w MCC quantifies its extra cost, while −$2 200 on State_OK shows Oklahoma reimburses less than the baseline group.

# reconstruct full feature list
ohe   = pipe.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = pd.Series(pipe.named_steps['model'].coef_,
                  index=feature_names).sort_values()

print("\nHighest positive coefficients:")
print(coefs.tail(10))
print("\nMost negative coefficients:")
print(coefs.head(10))

Because numerics were z‑scaled, each numeric coefficient is the USD change for a one‑σ increase in that metric; categorical flags are dollar shifts versus their reference categories.

9. Persist for production use

joblib.dump(pipe, "ridge_treatment_cost_model.pkl")

Summary

With a compact Ridge‑regression pipeline, we:

Converted publicly available CMS charge sheets into a predictive costing tool;
Achieved a typical error (MAE) in the low thousands of dollars—good enough for budgeting before the patient is discharged;
Retained complete transparency—every DRG, hospital or statewide factor shows its exact dollar effect;
Produced a single .pkl file that any hospital’s admission system can load to predict costs in real‑time.

Use this Ridge baseline as your sanity check; every fancy gradient‑boosted, multi‑task, or deep‑tabular model must beat its MAE and still explain costs as clearly to clinicians and CFOs.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook