Hospital Staffing Cost Prediction with Ridge Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Every hospital must translate tomorrow’s patient census into a labour budget for unit nurses, respiratory therapists, clerks, porters, and technical staff. If the staffing budget overshoots, operating margin erodes; if it undershoots, quality and safety scores suffer.

We will create a Ridge‑regression model that predicts a unit’s daily staffing cost (USD) from routinely captured workload and mix‑of‑care indicators:

Predictor family Examples taken from the dataset
Activity volume Adjusted patient days, ER visits
Hours worked Nursing, clinical support, admin, tech services
Hospital profile Teaching vs non‑teaching flag, urban vs rural
Calendar cues Day of week, fiscal quarter

Ridge keeps the relationship linear—every coefficient is immediately interpretable in dollars—while its L2 penalty damps the instability that arises when many hour buckets move in lock‑step.

Libraries Required

  • pandas # read/reshape CSV
  • numpy # numeric helpers
  • matplotlib.pyplot # optional quick plots
  • scikit‑learn # preprocessing, RidgeCV, metrics
  • joblib # save the finished pipeline

Dataset Link

Hospital Staffing

Step by Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load and peek

Bring variables on comparable variance so the Ridge penalty treats nursing hours the same as tech hours, rather than weighting by raw magnitude.

DATA_PATH = Path("hospital_staffing.csv")
df = pd.read_csv(DATA_PATH)
print(df.head())

Typical columns (abridged)

column description
Cost_USD target – total paid hours × wage
Adj_Patient_Days workload denominator
Hours_Nursing productive hours – registered nurses
Hours_ClinicalSupport respiratory, radiology, lab, etc.
Hours_Admin unit clerks, schedulers
Hours_Tech biomedical, IT
Teaching_Hospital 0 / 1
Urban_Rural Urban / Rural
DayOfWeek Mon … Sun
Quarter Q1 … Q4

3. Clean‑up and feature lists

  • Categorical flags (day of week, urban/rural) become binary columns; dropping the first level avoids the dummy‑variable trap.
  • A +$4,500 coefficient on Hours_Nursing (per σ) quantifies how sensitive total cost is to extra RN time, while a −$1 200 coefficient on DayOfWeek_Sat underscores the cost relief of a weekend elective‑surgery slowdown.
# drop incomplete rows
df = df.dropna(subset=['Cost_USD']).copy()

num_cols = ['Adj_Patient_Days', 'Hours_Nursing', 'Hours_ClinicalSupport',
            'Hours_Admin', 'Hours_Tech']
cat_cols = ['Teaching_Hospital', 'Urban_Rural', 'DayOfWeek', 'Quarter']
target   = 'Cost_USD'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing and Ridge model

Cross‑validated Ridge picks the α value that minimises validation error, giving you the simplest model that still generalises well.

preprocess = ColumnTransformer([
        ('cats', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols),
        ('nums', StandardScaler(),                                  num_cols)
])

ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)

pipe = Pipeline([
        ('prep', preprocess),
        ('model', ridge)
])

5. Fit/split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=True, random_state=42)

pipe.fit(X_train, y_train)

6. Hold‑out performance

pred = pipe.predict(X_test)
print(f"Chosen α (L2 penalty) : {pipe.named_steps['model'].alpha_}")
print(f"R² on test set        : {r2_score(y_test, pred):.3f}")
print(f"MAE on test set       : ${mean_absolute_error(y_test, pred):,.0f}")

7. Cost drivers – coefficient view

# rebuild full feature list
ohe = pipe.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = (pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
         .sort_values())

print("\nCost reducers (most negative):")
print(coefs.head(8))

print("\nCost adders (most positive):")
print(coefs.tail(8))

Because numeric variables were z‑scored, each numeric coefficient is the dollar change for a one‑standard‑deviation increase; each one‑hot flag is the dollar shift versus the reference category.

8. Persist the pipeline

Hours buckets track each other—if admin hours rise, nursing often does too; OLS coefficients swing wildly. Ridge stabilises them without losing linear transparency.

joblib.dump(pipe, "ridge_hospital_staff_cost.pkl")

Summary

This Ridge‑regression pipeline turns routine staffing and workload metrics into an instant, line‑item cost forecast for tomorrow’s hospital shift:

  • Operational pay‑off: finance can preview overruns before the schedule is locked.
  • Actionable insights: every coefficient is a dollar figure tied to a staffing lever or workload flag.
  • Benchmark model: any future random‑forest or gradient‑boosted machine must beat this MAE and remain explainable to the chief nursing officer.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *