Clinic Wait Time Prediction using Linear Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Outpatient clinics, vaccination centres, and emergency rooms often get slammed with more walk‑ins than they can handle, leading to long, unpredictable queues. Knowing how many minutes a new patient is likely to wait before seeing a clinician helps front‑desk staff manage expectations, smooth the flow, and trigger surge staffing when necessary.
In this mini-project, we build a linear regression baseline that predicts a patient’s expected wait time (in minutes) from information available the instant they take a ticket—arrival timestamp, day of the week, hour of the day, triage level, patient age, and current queue length. A transparent model surfaces the first‑order drivers of delay and sets a factual benchmark before you graduate to queue‑simulation or gradient‑boosted trees.
Libraries Required
- pandas # tabular wrangling
- numpy # numeric helpers
- matplotlib.pyplot # quick scatterplots/sanity checks
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression? For a given queue length and staffing level, each new patient typically adds a roughly constant incremental delay to the overall wait time. A straight‑line model captures this first‑order relationship, is lightning‑fast to train, and produces coefficients that managers can act on immediately.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
Download the CSV from Kaggle and point to its path:
df = pd.read_csv("er_wait_times.csv") # file name in the dataset
print(df.head()) # peek at columns
Expected key columns
| arrival_time | timestamp when the patient registered |
| triage_level | integer acuity code (1 high priority → 5 low) |
| age | patient age (years) |
| current_queue | number of patients already waiting at arrival |
| wait_minutes | label – actual minutes until first provider |
3. Feature engineering
- Calendar one‑hots (hour, dayofweek) capture predictable surges—Monday mornings or lunchtime peaks—without demanding an explicit holiday calendar feed.
- Queue length is the single most powerful real‑time signal; outliers are clipped at the 99th percentile to prevent one bizarre day from skewing the fit.
# ----- 3.3.1 Time features ----- df['arrival_time'] = pd.to_datetime(df['arrival_time']) df['hour'] = df['arrival_time'].dt.hour df['dayofweek'] = df['arrival_time'].dt.dayofweek # 0‑Mon … 6‑Sun # ----- 3.3.2 Cap extreme queue counts (optional) ----- df['current_queue'] = df['current_queue'].clip(upper=df['current_queue'].quantile(0.99))
4. Define predictors & label
Standard scaling places numeric predictors on comparable variance, so the coefficient magnitudes read as minutes per standard deviation change—handy for stakeholder slides.
num_cols = ['age', 'current_queue', 'hour'] cat_cols = ['triage_level', 'dayofweek'] target = 'wait_minutes' # drop rows still missing critical data df = df.dropna(subset=num_cols + cat_cols + [target]) X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & model pipeline
preprocess = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline(steps=[
('prep', preprocess),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} minutes")
8. Inspect influential features
# get feature names post‑encoding
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
.sort_values()
print("\nFast‑track factors (negative coefficients):")
print(coefs.head(8))
print("\nDelay drivers (positive coefficients):")
print(coefs.tail(8))
9. Persist the pipeline
Pipeline persistence (joblib.dump) freezes both preprocessing and regression weights; tomorrow’s web form can call joblib.load and issue a wait‑time estimate in milliseconds.
joblib.dump(pipe, "clinic_wait_time_linreg.pkl")
Summary
With just 70 lines of Python, we transformed raw arrival logs into an explainable clinic wait-time predictor. The linear model delivers two wins:
- Actionable ETAs for front‑desk staff to manage patient expectations.
- Transparent coefficients highlighting levers: every extra person in the queue adds ~4 minutes, triage of 1 patient cuts straight to the top, and Monday 8–10 a.m. spikes tack on an additional 7 minutes.
Keep this interpretable baseline as your yardstick; when you explore queuing theory, simulation, or gradient‑boosted forests, you’ll know exactly how much real‑world accuracy the extra complexity buys—and whether it justifies the added operational overhead.