Employee Training Cost Prediction using Bayesian Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
HR directors and L&D managers need to forecast the per‑employee training cost—before rolling out new programs—using early indicators such as prior experience (years), role complexity (level), number of planned training hours, department, and historical performance score. Training cost grows nonlinearly with training load (e.g., bulk‑hour discounts at high volumes, specialized courses at premium rates) and carries uncertainty from course cancellations, schedule shifts, and variable trainer rates. A classic point‑estimate model hides that uncertainty, risking budget overruns or under-utilisation. By applying Bayesian Regression, we derive both:
1. A point estimate of the expected training cost per employee.
2. A credible interval quantifying forecast uncertainty—enabling data‑driven L&D budgeting and resource allocation.
Libraries Required
import pandas as pd # data loading & manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Employee/HR Dataset (All in One)
Step-by-Step Code Implementation
Data Loading & Preprocessing
- We load YearsExperience, RoleLevel, TrainingHours, PerformanceScore, one‑hot encode Department, and target Training Cost per employee.
- Numeric predictors are z‑scored for MCMC stability.
import pandas as pd
# Load the employee dataset
df = pd.read_csv("data/EmployeeDataset.csv")
# Select features and target; drop missing
df = df[['YearsExperience','RoleLevel','TrainingHours',
'Department','PerformanceScore','Training Cost']].dropna()
# One‑hot encode the Department categorical variable
df = pd.get_dummies(df, columns=['Department'], drop_first=True)
# Define predictors X and target y
X = df.drop(columns='Training Cost').values
y = df['Training Cost'].values # USD per employee
# Split into train/test (80% train / 20% test)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize numeric predictors for stable MCMC
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train[:, :4]) # first 4 columns are numeric
X_train_s = X_train.copy()
X_train_s[:, :4] = scaler.transform(X_train[:, :4])
X_test_s = X_test.copy()
X_test_s[:, :4] = scaler.transform(X_test[:, :4])
Define & Fit Bayesian Regression Model
Model Specification:
- α ∼ Normal(0, 1 000) is a broad intercept prior;
- βᵢ ∼ Normal(0, 100) for each standardized feature;
- σ ∼ HalfNormal(200) constrains residual noise.
Likelihood: Observed cost ∼ Normal(α + β·X_std, σ).
Inference: We draw 2,000 posterior samples with 1,000 burn‑in, using target_accept=0.9 for reliable convergence diagnostics.
import pymc3 as pm
with pm.Model() as train_cost_model:
# Priors for intercept α and coefficients β
α = pm.Normal("α", mu=0, sigma=1e3)
β = pm.Normal("β", mu=0, sigma=100, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=200)
# Expected training cost linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood: observed Training Cost
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample the posterior
trace = pm.sample(
draws=2000, # number of posterior samples
tune=1000, # burn‑in samples
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Posterior Predictive: Sampling Y_obs gives full predictive distributions—enabling both point forecasts (posterior mean) and 94% Highest Posterior Density intervals for any training‑hours scenario.
- Evaluation: Mean Absolute Error (MAE) on held‑out employees quantifies the average forecast error in USD.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize parameter posteriors
az.summary(trace, round_to=2)
# Posterior predictive sampling
with train_cost_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Posterior means of α and β
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions on the test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per employee")
Visualise Predictions & Credible Intervals
By varying one key driver (TrainingHours) while holding the others at their median values, we plot both the posterior mean cost curve and its credible band—showing how expected cost scales with hours and the uncertainty around that estimate.
import numpy as np
import matplotlib.pyplot as plt
# Vary TrainingHours; hold other features at median
hours_grid = np.linspace(X_train_s[:,2].min(), X_train_s[:,2].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,2] = hours_grid
with train_cost_model:
ppc_grid = pm.sample_posterior_predictive(trace,
var_names=["Y_obs"],
samples=1000)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Back‑transform TrainingHours
hours_orig = scaler.inverse_transform(
np.column_stack([grid[:,0],grid[:,1],grid[:,2],grid[:,3]])
)[:,2]
plt.figure(figsize=(8,5))
plt.plot(hours_orig, mean_pred, label="Posterior mean")
plt.fill_between(hours_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,2], y_test,
color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Training Hours")
plt.ylabel("Training Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Training Hours")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression pipeline for Employee Training Cost Prediction provides:
1. Accurate point estimates of per‑employee training expense from early program parameters.
2. Credible intervals quantifying uncertainty from schedule changes, cancellation risks, and trainer‐rate variability.
3. Actionable insights: L&D managers can budget training programs with explicit cost projections and confidence bounds—optimising resource allocation and minimising financial risk.