Employee Training Cost Prediction with Ridge Regression in ML

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

Corporate learning teams are under growing pressure to prove the financial efficiency of every training programme. If a mandatory course balloons in cost while delivering modest skill gains, finance will pull the plug; if an up‑skilling bootcamp is cheap but later drives promotions, HR wants to scale it out.

Our goal is to build a Ridge‑regression model that predicts an employee’s training cost (USD) for the coming performance cycle from attributes already stored in the HRIS:

job level, department, and location
tenure and current annual salary
historical training‑hours tally
performance‑rating band
whether the employee is earmarked as “high potential”

Ridge keeps the relationship linear—every input has a dollar coefficient—while its L2 penalty prevents unstable weights when several HR variables overlap (e.g., seniority, salary, and job level).

Libraries Required

pandas # data wrangling
numpy # numeric helpers
matplotlib.pyplot # quick EDA (optional)
scikit‑learn # preprocessing, RidgeCV, metrics
joblib # save the trained model

Dataset Link

Employee/HR Dataset

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the Kaggle data

df = pd.read_csv("Employee.csv")        # file from the Kaggle zip
print(df[['Employee ID', 'Training Cost']].head())

The Employee/HR Dataset (All‑in‑One) includes columns such as:

field	sample values
Training Cost	1 350 USD (target)
Department	Sales / HR / IT
Job Level	Senior / Mid / Entry
Location	New York / Berlin …
Tenure_Yrs	3.4
Salary_USD	72 000
Training Hours	28
Performance Band	Excellent / Good / Needs Improvement
High_Potential	0 / 1

3. Set up numeric and categorical feature groups

Puts salary (tens of thousands) and training hours (tens) on comparable variance, so Ridge’s penalty treats them evenly.

num_cols = ['Tenure_Yrs', 'Salary_USD', 'Training Hours']
cat_cols = ['Department', 'Job Level', 'Location',
            'Performance Band', 'High_Potential']
target   = 'Training Cost'

X = df[num_cols + cat_cols]
y = df[target]

4. Build the preprocessing + Ridge pipeline

Cross‑validated Ridge searches over α values, delivering the simplest model (in bias–variance terms) without manual tuning
Converts categorical HR factors—department, job level, location—into binary flags; dropping the first level avoids perfect multicollinearity.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                                  num_cols)
])

ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)

model = Pipeline([
        ('prep',  preproc),
        ('ridge', ridge)
])

5. Train/test split and model fitting

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

model.fit(X_train, y_train)

6. Evaluate hold‑out performance

pred = model.predict(X_test)

print(f"Chosen α (L2)  : {model.named_steps['ridge'].alpha_}")
print(f"Test‑set R²    : {r2_score(y_test, pred):.3f}")
print(f"Test‑set MAE   : ${mean_absolute_error(y_test, pred):,.0f}")

7. Understand the dollar drivers

A +$420 coefficient on Training Hours (per σ) quantifies how much each extra one‑standard‑deviation of hours lifts cost. A −$350 weight on Department_HR signals HR programmes tend to be cheaper than the baseline department.

# combine one‑hot names with numeric names
ohe = model.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = pd.Series(model.named_steps['ridge'].coef_, index=feature_names)\
         .sort_values()

print("\nCost reducers (most negative weights):")
print(coefs.head(6))

print("\nCost adders (most positive weights):")
print(coefs.tail(6))

Because all numeric predictors were z‑scored, each numeric coefficient reads as USD change for a one‑standard‑deviation increase; one‑hot flags show the dollar offset relative to the reference category.

8. Persist the model for HR dashboards

Salary, tenure, and job level are partially collinear; Ridge shrinks coefficients toward zero, yielding more stable, generalisable estimates while staying linear.

joblib.dump(model, "ridge_employee_training_cost.pkl")

Summary

This workflow turns raw HRIS tables into an explainable employee‑training cost forecaster:

Early‑stage budgeting: Finance can project next‑quarter L&D spend before enrolments close.
Transparent levers: every coefficient is a dollar impact tied to a policy knob (hours, salary band, high‑potential flag).
Sturdy baseline: any future gradient‑boosted or causal model must beat this Ridge MAE and remain interpretable for HR leadership.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook