Medical Treatment Cost Prediction with Ridge Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Hospitals in the United States send payers (insurers, Medicare, patients) an itemised bill at the end of every inpatient stay. Knowing before a patient is discharged how big that bill is likely to be lets finance teams:
- flag cases that may breach a DRG outlier threshold,
- negotiate bundled‑payment exceptions early, and
- Give patients realistic price estimates that satisfy transparency rules.
Using the public “Hospital Inpatient Charges” data released by CMS (Centres for Medicare & Medicaid Services), we will train a Ridge‑regression model that predicts a stay’s average total payment (USD) from information already available on admission:
| Predictor family | Examples in the dataset |
| Hospital profile | Provider ID, State, Urban / Rural indicator |
| Clinical DRG | DRG code and short description |
| Case mix | Number of discharges, Mean length of stay |
| Payer mix | Medicare‑covered % (implicit in CMS data) |
Ridge regression (linear + L2 penalty) keeps the model transparent—every coefficient is a dollar amount—while damping unstable weights that often appear when DRG codes and hospital dummies overlap.
Libraries Required
- pandas # table wrangling
- numpy # numerical helpers
- matplotlib.pyplot # optional diagnostic plots
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # save & reload the finished pipeline
Dataset Link
Hospital Inpatient Discharges Dataset
Step by Step Code Implementation
1. Import Libraries
Hospital & DRG dummies correlate (some states never perform certain DRGs). Ridge’s L2 penalty shrinks unstable coefficients, giving better generalisation and cleaner stories for finance teams.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the CMS dataset
Total Discharges (~1–2 000) and Ave LOS (1–30 days) sit on very different scales; scaling prevents Ridge from over‑penalising the larger‑magnitude feature.
# File from Kaggle link in the next section
df = pd.read_csv("Hospital_Inpatient_Discharges_Dataset.csv")
print(df.shape)
print(df.head())
Key raw fields (names in the 2016 file)
| column | sample |
| Provider Id | 330154 |
| Provider State | NY |
| DRG Definition | “207 ‑ Respiratory system diagnosis w/o MCC” |
| Total Discharges | 68 |
| Average Length of Stay | 4.9 |
| Average Total Payments | ← target (USD) |
3. Minimal cleaning
Provider IDs and DRG descriptions are categorical with no inherent ordering; one‑hotting turns each into its intercept shift.
# keep rows with all essential columns
core = ['Average Total Payments', 'Provider Id', 'Provider State',
'DRG Definition', 'Total Discharges', 'Average Length of Stay']
df = df.dropna(subset=core).copy()
# rename for convenience
df = df.rename(columns={
'Average Total Payments': 'TotalPayUSD',
'Provider Id': 'ProviderID',
'Provider State': 'State',
'DRG Definition': 'DRG'
})
4. Separate numeric & categorical features
num_cols = ['Total Discharges', 'Average Length of Stay'] cat_cols = ['ProviderID', 'State', 'DRG'] target = 'TotalPayUSD' X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & Ridge pipeline
Cross‑validated Ridge automatically finds the α that balances bias (under‑fit) and variance (over‑fit) across five folds—no manual grid search needed.
pre = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
alphas = [0.1, 1, 10, 50, 100, 250] # candidate L2 strengths
ridge = RidgeCV(alphas=alphas, cv=5)
pipe = Pipeline([
('prep', pre),
('model', ridge)
])
6. Train‑test split & fitting
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Performance on unseen hospitals
pred = pipe.predict(X_test)
print(f"Chosen α (L2) : {pipe.named_steps['model'].alpha_}")
print(f"R² on hold‑out set : {r2_score(y_test, pred):.3f}")
print(f"MAE on hold‑out : ${mean_absolute_error(y_test, pred):,.0f}")
8. Which factors push costs?
Coefficients remain dollar figures. Example: a +$3 950 weight on DRG_870‑Septicemia w MCC quantifies its extra cost, while −$2 200 on State_OK shows Oklahoma reimburses less than the baseline group.
# reconstruct full feature list
ohe = pipe.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coefs = pd.Series(pipe.named_steps['model'].coef_,
index=feature_names).sort_values()
print("\nHighest positive coefficients:")
print(coefs.tail(10))
print("\nMost negative coefficients:")
print(coefs.head(10))
Because numerics were z‑scaled, each numeric coefficient is the USD change for a one‑σ increase in that metric; categorical flags are dollar shifts versus their reference categories.
9. Persist for production use
joblib.dump(pipe, "ridge_treatment_cost_model.pkl")
Summary
With a compact Ridge‑regression pipeline, we:
- Converted publicly available CMS charge sheets into a predictive costing tool;
- Achieved a typical error (MAE) in the low thousands of dollars—good enough for budgeting before the patient is discharged;
- Retained complete transparency—every DRG, hospital or statewide factor shows its exact dollar effect;
- Produced a single .pkl file that any hospital’s admission system can load to predict costs in real‑time.
Use this Ridge baseline as your sanity check; every fancy gradient‑boosted, multi‑task, or deep‑tabular model must beat its MAE and still explain costs as clearly to clinicians and CFOs.