Clinic Operation Cost Prediction with Ridge Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Outpatient clinics record dozens of operational signals every day—scheduled visits, walk‑ins, hours worked by nurses and physicians, diagnostic tests performed, and square‑foot hours of exam‑room usage. Finance managers need a model that turns those signals into an up‑front estimate of daily operating cost (USD), so they can:
- Spot budget overruns before month‑end,
- Justify staffing changes on high‑load days, and
- Evaluate cost‑saving projects such as automated check‑in kiosks.
Because many workload metrics move together (for example, physician hours and diagnostic‑lab hours rise in tandem), we will use Ridge regression—a linear model with an L2 penalty—to stabilise coefficients while preserving a fully interpretable dollar‑per‑unit story.
Libraries Required
- pandas # tidy data handling
- numpy # numerical helpers
- matplotlib.pyplot # quick diagnostic plots (optional)
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # persist the fitted pipeline
Dataset Link
Hospital Cost Dataset (cleaned)
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the dataset
df = pd.read_csv("clinic_operating_costs.csv") # rename after download
print(df.head())
3. Basic cleaning & selection
Brings numeric predictors (visits, staff hours, utilisation) to zero‑mean / unit‑variance so Ridge’s penalty treats them evenly.
# retain outpatient/clinic records only df = df[df['care_type'] == 'Clinic'].copy() # drop rows with missing cost target df = df.dropna(subset=['total_cost_usd'])
4. Feature groups
Converts weekday, month, location and speciality into binary flags without implying numerical order; dropping the first level avoids perfect collinearity
num_cols = ['visits_scheduled', # scheduled appointments
'visits_walkin', # unscheduled arrivals
'hours_physician',
'hours_nurse',
'hours_admin',
'lab_tests',
'room_util_pct'] # exam‑room utilisation %
cat_cols = ['weekday', # Mon … Sun
'month', # 1 … 12
'clinic_location', # urban / suburban / rural
'specialty'] # family / derm / cardio …
target = 'total_cost_usd'
X = df[num_cols + cat_cols]
y = df[target]
5. Pre‑processing and Ridge pipeline
Runs a five‑fold cross‑validation over candidate α values, selecting the one that minimises validation error, thus automating bias–variance trade‑off.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)
model = Pipeline([
('prep', preprocess),
('ridge', ridge)
])
6. Train‑test split & model fitting
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
model.fit(X_train, y_train)
7. Evaluation
pred = model.predict(X_test)
print(f"Optimal α (L2) : {model.named_steps['ridge'].alpha_}")
print(f"Hold‑out R² : {r2_score(y_test, pred):.3f}")
print(f"Hold‑out MAE : ${mean_absolute_error(y_test, pred):,.0f}")
8. Cost‑driver inspection
Coefficients stay in dollars: a +$1,250 coefficient on hours_nurse (per σ) shows how sensitive cost is to nursing labour, while a −$900 weight on weekday_Sat quantifies Saturday savings.
# rebuild feature list
ohe = model.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coefs = pd.Series(model.named_steps['ridge'].coef_,
index=feature_names).sort_values()
print("\nGreatest cost reducers:")
print(coefs.head(8))
print("\nGreatest cost drivers:")
print(coefs.tail(8))
Numeric coefficients are measured in USD per one‑standard‑deviation increase; dummy‑variable coefficients are dollar shifts vs. the reference level.
9. Model persistence
Staff‑hour buckets are correlated (physician and nurse hours rise together). Ridge’s L2 term shrinks unstable coefficients, improving generalisation while maintaining a linear, CFO‑friendly model.
joblib.dump(model, "ridge_clinic_cost_model.pkl")
Summary
In roughly 120 lines of code, we turned a public cost dataset into an explainable clinic‑operation cost predictor:
- Early warning: finance can flag high‑cost days before payroll and purchase orders hit.
- Actionable insights: every staffing or workload knob has a quantified dollar impact.
- Robust baseline: any future tree ensemble or neural network must beat this Ridge model’s MAE and provide just as straightforward a cost narrative for healthcare executives.