Employee Training Cost Prediction using Bayesian Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

HR directors and L&D managers need to forecast the per‑employee training cost—before rolling out new programs—using early indicators such as prior experience (years), role complexity (level), number of planned training hours, department, and historical performance score. Training cost grows nonlinearly with training load (e.g., bulk‑hour discounts at high volumes, specialized courses at premium rates) and carries uncertainty from course cancellations, schedule shifts, and variable trainer rates. A classic point‑estimate model hides that uncertainty, risking budget overruns or under-utilisation. By applying Bayesian Regression, we derive both:

1. A point estimate of the expected training cost per employee.

2. A credible interval quantifying forecast uncertainty—enabling data‑driven L&D budgeting and resource allocation.

Libraries Required

import pandas as pd                              # data loading & manipulation  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error  

Dataset

Employee/HR Dataset (All in One)

Step-by-Step Code Implementation

Data Loading & Preprocessing

  • We load YearsExperience, RoleLevel, TrainingHours, PerformanceScore, one‑hot encode Department, and target Training Cost per employee.
  • Numeric predictors are z‑scored for MCMC stability.
import pandas as pd

# Load the employee dataset
df = pd.read_csv("data/EmployeeDataset.csv")

# Select features and target; drop missing
df = df[['YearsExperience','RoleLevel','TrainingHours',
         'Department','PerformanceScore','Training Cost']].dropna()

# One‑hot encode the Department categorical variable
df = pd.get_dummies(df, columns=['Department'], drop_first=True)

# Define predictors X and target y
X = df.drop(columns='Training Cost').values
y = df['Training Cost'].values  # USD per employee

# Split into train/test (80% train / 20% test)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize numeric predictors for stable MCMC
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train[:, :4])  # first 4 columns are numeric
X_train_s = X_train.copy()
X_train_s[:, :4] = scaler.transform(X_train[:, :4])
X_test_s  = X_test.copy()
X_test_s[:, :4]  = scaler.transform(X_test[:, :4])

Define & Fit Bayesian Regression Model

Model Specification:

  • α ∼ Normal(0, 1 000) is a broad intercept prior;
  • βᵢ ∼ Normal(0, 100) for each standardized feature;
  • σ ∼ HalfNormal(200) constrains residual noise.

Likelihood: Observed cost ∼ Normal(α + β·X_std, σ).

Inference: We draw 2,000 posterior samples with 1,000 burn‑in, using target_accept=0.9 for reliable convergence diagnostics.

import pymc3 as pm

with pm.Model() as train_cost_model:
    # Priors for intercept α and coefficients β
    α = pm.Normal("α", mu=0, sigma=1e3)
    β = pm.Normal("β", mu=0, sigma=100, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=200)

    # Expected training cost linear predictor
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood: observed Training Cost
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # Sample the posterior
    trace = pm.sample(
        draws=2000,       # number of posterior samples
        tune=1000,        # burn‑in samples
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

  • Posterior Predictive: Sampling Y_obs gives full predictive distributions—enabling both point forecasts (posterior mean) and 94% Highest Posterior Density intervals for any training‑hours scenario.
  • Evaluation: Mean Absolute Error (MAE) on held‑out employees quantifies the average forecast error in USD.
import arviz as az
from sklearn.metrics import mean_absolute_error

# Summarize parameter posteriors
az.summary(trace, round_to=2)

# Posterior predictive sampling
with train_cost_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Posterior means of α and β
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point predictions on the test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f} per employee")

Visualise Predictions & Credible Intervals

By varying one key driver (TrainingHours) while holding the others at their median values, we plot both the posterior mean cost curve and its credible band—showing how expected cost scales with hours and the uncertainty around that estimate.

import numpy as np
import matplotlib.pyplot as plt

# Vary TrainingHours; hold other features at median
hours_grid = np.linspace(X_train_s[:,2].min(), X_train_s[:,2].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,2] = hours_grid

with train_cost_model:
    ppc_grid = pm.sample_posterior_predictive(trace,
                                              var_names=["Y_obs"],
                                              samples=1000)

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Back‑transform TrainingHours
hours_orig = scaler.inverse_transform(
    np.column_stack([grid[:,0],grid[:,1],grid[:,2],grid[:,3]])
)[:,2]

plt.figure(figsize=(8,5))
plt.plot(hours_orig, mean_pred, label="Posterior mean")
plt.fill_between(hours_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,2], y_test,
    color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Training Hours")
plt.ylabel("Training Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Training Hours")
plt.legend()
plt.tight_layout()
plt.show()

Summary

This Bayesian Regression pipeline for Employee Training Cost Prediction provides:

1. Accurate point estimates of per‑employee training expense from early program parameters.

2. Credible intervals quantifying uncertainty from schedule changes, cancellation risks, and trainer‐rate variability.

3. Actionable insights: L&D managers can budget training programs with explicit cost projections and confidence bounds—optimising resource allocation and minimising financial risk.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *