Course Engagement Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Online learning providers incur variable operational costs—such as bandwidth, CPU, and instructor time—for each active learner in a course. These engagement costs depend on features such as video watch time, number of quiz attempts, forum posts, and assignment submissions. Engagement behaviour is nonlinear (e.g., each extra hour of video costs less marginally, while high forum activity can spike moderation cost) and uncertain due to user-behaviour variability. We need a model that produces both a point estimate of per‐learner engagement cost and a credible interval quantifying our uncertainty, so platform operators can budget, price, and provision resources with confidence.
Libraries Required
import pandas as pd # data I/O import numpy as np # numerical ops import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Online Course Student Engagement Metrics
Step-by-Step Code Implementation
Load Data & Inspect
import pandas as pd
# Load engagement metrics
df = pd.read_csv("data/online_course_engagement_metrics.csv")
# Preview features and target
df[['video_watch_time','quiz_attempts','forum_posts',
'assignment_submissions','engagement_cost']].head()
Preprocessing & Train/Test Split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Define predictors and target
X = df[['video_watch_time','quiz_attempts','forum_posts','assignment_submissions']].values
y = df['engagement_cost'].values # in USD
# Split into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize features for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit the Bayesian Regression Model
Priors:
- α ∼ Normal(0, 10) represents baseline cost uncertainty.
- β ∼ Normal(0, 5) allows moderate coefficient variation.
- σ ∼ HalfNormal(5) enforces positive residual noise.
Model: We posit engagement_cost ∼ Normal(α + β·X, σ).
Sampling: We draw 2,000 posterior samples (plus 1,000 tuning steps) with target_accept=0.9 for stable inference.
import pymc3 as pm
with pm.Model() as cost_model:
# Priors: intercept α and coefficients β
α = pm.Normal("α", mu=0, sigma=10)
β = pm.Normal("β", mu=0, sigma=5, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=5)
# Expected engagement cost
μ = α + pm.math.dot(X_train_s, β)
# Likelihood: observed engagement_cost
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
# Sample from the posterior
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Sampling Y_obs yields full predictive distributions; we extract posterior mean forecasts and 94% Highest Posterior Density intervals.
- Mean Absolute Error quantifies point‐forecast accuracy on held‑out data.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize posterior distributions
az.summary(trace, round_to=2)
# Posterior predictive sampling
with cost_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Posterior means of parameters
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)
# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions & Credible Intervals
By sweeping one key feature (video_watch_time) while holding others at typical values, we visualise how engagement cost scales and how uncertain that scaling is—empowering budgeters to see both expected cost and risk.
import numpy as np
import matplotlib.pyplot as plt
# Vary video_watch_time; hold others at median
grid_time = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,0] = grid_time
with cost_model:
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"], samples=1000)
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Back‑transform video_watch_time to original scale
time_orig = scaler.inverse_transform(grid)[:,0]
plt.figure(figsize=(8,5))
plt.plot(time_orig, mean_pred, label="Posterior mean")
plt.fill_between(time_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0],
y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Video Watch Time (hours)")
plt.ylabel("Engagement Cost (USD)")
plt.title("Bayesian Regression: Engagement Cost vs. Watch Time")
plt.legend()
plt.tight_layout()
plt.show()
Summary
Using Bayesian Regression for Course Engagement Cost Prediction provides:
1. Point estimates of per‑learner engagement cost from early usage metrics.
2. Credible intervals that quantify forecasting uncertainty—crucial for operational budgeting.
3. Actionable insights: platform operators can provision compute and staffing with full awareness of cost bounds, optimise course design for cost efficiency, and price courses to cover expected variable costs under uncertainty.