Hospital Staffing Cost Prediction with Ridge Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Every hospital must translate tomorrow’s patient census into a labour budget for unit nurses, respiratory therapists, clerks, porters, and technical staff. If the staffing budget overshoots, operating margin erodes; if it undershoots, quality and safety scores suffer.

We will create a Ridge‑regression model that predicts a unit’s daily staffing cost (USD) from routinely captured workload and mix‑of‑care indicators:

Predictor family	Examples taken from the dataset
Activity volume	Adjusted patient days, ER visits
Hours worked	Nursing, clinical support, admin, tech services
Hospital profile	Teaching vs non‑teaching flag, urban vs rural
Calendar cues	Day of week, fiscal quarter

Ridge keeps the relationship linear—every coefficient is immediately interpretable in dollars—while its L2 penalty damps the instability that arises when many hour buckets move in lock‑step.

Libraries Required

pandas # read/reshape CSV
numpy # numeric helpers
matplotlib.pyplot # optional quick plots
scikit‑learn # preprocessing, RidgeCV, metrics
joblib # save the finished pipeline

Dataset Link

Hospital Staffing

Step by Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load and peek

Bring variables on comparable variance so the Ridge penalty treats nursing hours the same as tech hours, rather than weighting by raw magnitude.

DATA_PATH = Path("hospital_staffing.csv")
df = pd.read_csv(DATA_PATH)
print(df.head())

Typical columns (abridged)

column	description
Cost_USD	target – total paid hours × wage
Adj_Patient_Days	workload denominator
Hours_Nursing	productive hours – registered nurses
Hours_ClinicalSupport	respiratory, radiology, lab, etc.
Hours_Admin	unit clerks, schedulers
Hours_Tech	biomedical, IT
Teaching_Hospital	0 / 1
Urban_Rural	Urban / Rural
DayOfWeek	Mon … Sun
Quarter	Q1 … Q4

3. Clean‑up and feature lists

Categorical flags (day of week, urban/rural) become binary columns; dropping the first level avoids the dummy‑variable trap.
A +$4,500 coefficient on Hours_Nursing (per σ) quantifies how sensitive total cost is to extra RN time, while a −$1 200 coefficient on DayOfWeek_Sat underscores the cost relief of a weekend elective‑surgery slowdown.

# drop incomplete rows
df = df.dropna(subset=['Cost_USD']).copy()

num_cols = ['Adj_Patient_Days', 'Hours_Nursing', 'Hours_ClinicalSupport',
            'Hours_Admin', 'Hours_Tech']
cat_cols = ['Teaching_Hospital', 'Urban_Rural', 'DayOfWeek', 'Quarter']
target   = 'Cost_USD'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing and Ridge model

Cross‑validated Ridge picks the α value that minimises validation error, giving you the simplest model that still generalises well.

preprocess = ColumnTransformer([
        ('cats', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols),
        ('nums', StandardScaler(),                                  num_cols)
])

ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)

pipe = Pipeline([
        ('prep', preprocess),
        ('model', ridge)
])

5. Fit/split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=True, random_state=42)

pipe.fit(X_train, y_train)

6. Hold‑out performance

pred = pipe.predict(X_test)
print(f"Chosen α (L2 penalty) : {pipe.named_steps['model'].alpha_}")
print(f"R² on test set        : {r2_score(y_test, pred):.3f}")
print(f"MAE on test set       : ${mean_absolute_error(y_test, pred):,.0f}")

7. Cost drivers – coefficient view

# rebuild full feature list
ohe = pipe.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = (pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
         .sort_values())

print("\nCost reducers (most negative):")
print(coefs.head(8))

print("\nCost adders (most positive):")
print(coefs.tail(8))

Because numeric variables were z‑scored, each numeric coefficient is the dollar change for a one‑standard‑deviation increase; each one‑hot flag is the dollar shift versus the reference category.

8. Persist the pipeline

Hours buckets track each other—if admin hours rise, nursing often does too; OLS coefficients swing wildly. Ridge stabilises them without losing linear transparency.

joblib.dump(pipe, "ridge_hospital_staff_cost.pkl")

Summary

This Ridge‑regression pipeline turns routine staffing and workload metrics into an instant, line‑item cost forecast for tomorrow’s hospital shift:

Operational pay‑off: finance can preview overruns before the schedule is locked.
Actionable insights: every coefficient is a dollar figure tied to a staffing lever or workload flag.
Benchmark model: any future random‑forest or gradient‑boosted machine must beat this MAE and remain explainable to the chief nursing officer.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook