Employee Training Cost Prediction with Ridge Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Corporate learning teams are under growing pressure to prove the financial efficiency of every training programme. If a mandatory course balloons in cost while delivering modest skill gains, finance will pull the plug; if an up‑skilling bootcamp is cheap but later drives promotions, HR wants to scale it out.
Our goal is to build a Ridge‑regression model that predicts an employee’s training cost (USD) for the coming performance cycle from attributes already stored in the HRIS:
- job level, department, and location
- tenure and current annual salary
- historical training‑hours tally
- performance‑rating band
- whether the employee is earmarked as “high potential”
Ridge keeps the relationship linear—every input has a dollar coefficient—while its L2 penalty prevents unstable weights when several HR variables overlap (e.g., seniority, salary, and job level).
Libraries Required
- pandas # data wrangling
- numpy # numeric helpers
- matplotlib.pyplot # quick EDA (optional)
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # save the trained model
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the Kaggle data
df = pd.read_csv("Employee.csv") # file from the Kaggle zip
print(df[['Employee ID', 'Training Cost']].head())
The Employee/HR Dataset (All‑in‑One) includes columns such as:
| field | sample values |
| Training Cost | 1 350 USD (target) |
| Department | Sales / HR / IT |
| Job Level | Senior / Mid / Entry |
| Location | New York / Berlin … |
| Tenure_Yrs | 3.4 |
| Salary_USD | 72 000 |
| Training Hours | 28 |
| Performance Band | Excellent / Good / Needs Improvement |
| High_Potential | 0 / 1 |
3. Set up numeric and categorical feature groups
Puts salary (tens of thousands) and training hours (tens) on comparable variance, so Ridge’s penalty treats them evenly.
num_cols = ['Tenure_Yrs', 'Salary_USD', 'Training Hours']
cat_cols = ['Department', 'Job Level', 'Location',
'Performance Band', 'High_Potential']
target = 'Training Cost'
X = df[num_cols + cat_cols]
y = df[target]
4. Build the preprocessing + Ridge pipeline
- Cross‑validated Ridge searches over α values, delivering the simplest model (in bias–variance terms) without manual tuning
- Converts categorical HR factors—department, job level, location—into binary flags; dropping the first level avoids perfect multicollinearity.
preproc = ColumnTransformer([
('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)
model = Pipeline([
('prep', preproc),
('ridge', ridge)
])
5. Train/test split and model fitting
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
model.fit(X_train, y_train)
6. Evaluate hold‑out performance
pred = model.predict(X_test)
print(f"Chosen α (L2) : {model.named_steps['ridge'].alpha_}")
print(f"Test‑set R² : {r2_score(y_test, pred):.3f}")
print(f"Test‑set MAE : ${mean_absolute_error(y_test, pred):,.0f}")
7. Understand the dollar drivers
A +$420 coefficient on Training Hours (per σ) quantifies how much each extra one‑standard‑deviation of hours lifts cost. A −$350 weight on Department_HR signals HR programmes tend to be cheaper than the baseline department.
# combine one‑hot names with numeric names
ohe = model.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coefs = pd.Series(model.named_steps['ridge'].coef_, index=feature_names)\
.sort_values()
print("\nCost reducers (most negative weights):")
print(coefs.head(6))
print("\nCost adders (most positive weights):")
print(coefs.tail(6))
Because all numeric predictors were z‑scored, each numeric coefficient reads as USD change for a one‑standard‑deviation increase; one‑hot flags show the dollar offset relative to the reference category.
8. Persist the model for HR dashboards
Salary, tenure, and job level are partially collinear; Ridge shrinks coefficients toward zero, yielding more stable, generalisable estimates while staying linear.
joblib.dump(model, "ridge_employee_training_cost.pkl")
Summary
This workflow turns raw HRIS tables into an explainable employee‑training cost forecaster:
- Early‑stage budgeting: Finance can project next‑quarter L&D spend before enrolments close.
- Transparent levers: every coefficient is a dollar impact tied to a policy knob (hours, salary band, high‑potential flag).
- Sturdy baseline: any future gradient‑boosted or causal model must beat this Ridge MAE and remain interpretable for HR leadership.