Hospital Staffing Cost Prediction with Ridge Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Every hospital must translate tomorrow’s patient census into a labour budget for unit nurses, respiratory therapists, clerks, porters, and technical staff. If the staffing budget overshoots, operating margin erodes; if it undershoots, quality and safety scores suffer.
We will create a Ridge‑regression model that predicts a unit’s daily staffing cost (USD) from routinely captured workload and mix‑of‑care indicators:
| Predictor family | Examples taken from the dataset |
| Activity volume | Adjusted patient days, ER visits |
| Hours worked | Nursing, clinical support, admin, tech services |
| Hospital profile | Teaching vs non‑teaching flag, urban vs rural |
| Calendar cues | Day of week, fiscal quarter |
Ridge keeps the relationship linear—every coefficient is immediately interpretable in dollars—while its L2 penalty damps the instability that arises when many hour buckets move in lock‑step.
Libraries Required
- pandas # read/reshape CSV
- numpy # numeric helpers
- matplotlib.pyplot # optional quick plots
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # save the finished pipeline
Dataset Link
Step by Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np from pathlib import Path import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load and peek
Bring variables on comparable variance so the Ridge penalty treats nursing hours the same as tech hours, rather than weighting by raw magnitude.
DATA_PATH = Path("hospital_staffing.csv")
df = pd.read_csv(DATA_PATH)
print(df.head())
Typical columns (abridged)
| column | description |
| Cost_USD | target – total paid hours × wage |
| Adj_Patient_Days | workload denominator |
| Hours_Nursing | productive hours – registered nurses |
| Hours_ClinicalSupport | respiratory, radiology, lab, etc. |
| Hours_Admin | unit clerks, schedulers |
| Hours_Tech | biomedical, IT |
| Teaching_Hospital | 0 / 1 |
| Urban_Rural | Urban / Rural |
| DayOfWeek | Mon … Sun |
| Quarter | Q1 … Q4 |
3. Clean‑up and feature lists
- Categorical flags (day of week, urban/rural) become binary columns; dropping the first level avoids the dummy‑variable trap.
- A +$4,500 coefficient on Hours_Nursing (per σ) quantifies how sensitive total cost is to extra RN time, while a −$1 200 coefficient on DayOfWeek_Sat underscores the cost relief of a weekend elective‑surgery slowdown.
# drop incomplete rows
df = df.dropna(subset=['Cost_USD']).copy()
num_cols = ['Adj_Patient_Days', 'Hours_Nursing', 'Hours_ClinicalSupport',
'Hours_Admin', 'Hours_Tech']
cat_cols = ['Teaching_Hospital', 'Urban_Rural', 'DayOfWeek', 'Quarter']
target = 'Cost_USD'
X = df[num_cols + cat_cols]
y = df[target]
4. Pre‑processing and Ridge model
Cross‑validated Ridge picks the α value that minimises validation error, giving you the simplest model that still generalises well.
preprocess = ColumnTransformer([
('cats', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols),
('nums', StandardScaler(), num_cols)
])
ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)
pipe = Pipeline([
('prep', preprocess),
('model', ridge)
])
5. Fit/split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=True, random_state=42)
pipe.fit(X_train, y_train)
6. Hold‑out performance
pred = pipe.predict(X_test)
print(f"Chosen α (L2 penalty) : {pipe.named_steps['model'].alpha_}")
print(f"R² on test set : {r2_score(y_test, pred):.3f}")
print(f"MAE on test set : ${mean_absolute_error(y_test, pred):,.0f}")
7. Cost drivers – coefficient view
# rebuild full feature list
ohe = pipe.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coefs = (pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
.sort_values())
print("\nCost reducers (most negative):")
print(coefs.head(8))
print("\nCost adders (most positive):")
print(coefs.tail(8))
Because numeric variables were z‑scored, each numeric coefficient is the dollar change for a one‑standard‑deviation increase; each one‑hot flag is the dollar shift versus the reference category.
8. Persist the pipeline
Hours buckets track each other—if admin hours rise, nursing often does too; OLS coefficients swing wildly. Ridge stabilises them without losing linear transparency.
joblib.dump(pipe, "ridge_hospital_staff_cost.pkl")
Summary
This Ridge‑regression pipeline turns routine staffing and workload metrics into an instant, line‑item cost forecast for tomorrow’s hospital shift:
- Operational pay‑off: finance can preview overruns before the schedule is locked.
- Actionable insights: every coefficient is a dollar figure tied to a staffing lever or workload flag.
- Benchmark model: any future random‑forest or gradient‑boosted machine must beat this MAE and remain explainable to the chief nursing officer.