Course Engagement Cost Prediction using Bayesian Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Online learning providers incur variable operational costs—such as bandwidth, CPU, and instructor time—for each active learner in a course. These engagement costs depend on features such as video watch time, number of quiz attempts, forum posts, and assignment submissions. Engagement behaviour is nonlinear (e.g., each extra hour of video costs less marginally, while high forum activity can spike moderation cost) and uncertain due to user-behaviour variability. We need a model that produces both a point estimate of per‐learner engagement cost and a credible interval quantifying our uncertainty, so platform operators can budget, price, and provision resources with confidence.

Libraries Required

import pandas as pd                              # data I/O  
import numpy as np                               # numerical ops  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error

Dataset

Online Course Student Engagement Metrics

Step-by-Step Code Implementation

Load Data & Inspect

import pandas as pd

# Load engagement metrics
df = pd.read_csv("data/online_course_engagement_metrics.csv")

# Preview features and target
df[['video_watch_time','quiz_attempts','forum_posts',
    'assignment_submissions','engagement_cost']].head()

Preprocessing & Train/Test Split

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define predictors and target
X = df[['video_watch_time','quiz_attempts','forum_posts','assignment_submissions']].values
y = df['engagement_cost'].values  # in USD

# Split into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Define & Fit the Bayesian Regression Model

Priors:

α ∼ Normal(0, 10) represents baseline cost uncertainty.
β ∼ Normal(0, 5) allows moderate coefficient variation.
σ ∼ HalfNormal(5) enforces positive residual noise.

Model: We posit engagement_cost ∼ Normal(α + β·X, σ).

Sampling: We draw 2,000 posterior samples (plus 1,000 tuning steps) with target_accept=0.9 for stable inference.

import pymc3 as pm

with pm.Model() as cost_model:
    # Priors: intercept α and coefficients β
    α = pm.Normal("α", mu=0, sigma=10)
    β = pm.Normal("β", mu=0, sigma=5, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=5)

    # Expected engagement cost
    μ = α + pm.math.dot(X_train_s, β)

    # Likelihood: observed engagement_cost
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)

    # Sample from the posterior
    trace = pm.sample(
        draws=2000,
        tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

Sampling Y_obs yields full predictive distributions; we extract posterior mean forecasts and 94% Highest Posterior Density intervals.
Mean Absolute Error quantifies point‐forecast accuracy on held‑out data.

import arviz as az
from sklearn.metrics import mean_absolute_error

# Summarize posterior distributions
az.summary(trace, round_to=2)

# Posterior predictive sampling
with cost_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Posterior means of parameters
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)

# Evaluate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")

Visualise Predictions & Credible Intervals

By sweeping one key feature (video_watch_time) while holding others at typical values, we visualise how engagement cost scales and how uncertain that scaling is—empowering budgeters to see both expected cost and risk.

import numpy as np
import matplotlib.pyplot as plt

# Vary video_watch_time; hold others at median
grid_time = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,0] = grid_time

with cost_model:
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"], samples=1000)

preds     = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd       = az.hdi(preds, hdi_prob=0.94)

# Back‑transform video_watch_time to original scale
time_orig = scaler.inverse_transform(grid)[:,0]

plt.figure(figsize=(8,5))
plt.plot(time_orig, mean_pred, label="Posterior mean")
plt.fill_between(time_orig, hpd[:,0], hpd[:,1], alpha=0.3,
                 label="94% credible interval")
plt.scatter(
    scaler.inverse_transform(X_test_s)[:,0],
    y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Video Watch Time (hours)")
plt.ylabel("Engagement Cost (USD)")
plt.title("Bayesian Regression: Engagement Cost vs. Watch Time")
plt.legend()
plt.tight_layout()
plt.show()

Summary

Using Bayesian Regression for Course Engagement Cost Prediction provides:

1. Point estimates of per‑learner engagement cost from early usage metrics.

2. Credible intervals that quantify forecasting uncertainty—crucial for operational budgeting.

3. Actionable insights: platform operators can provision compute and staffing with full awareness of cost bounds, optimise course design for cost efficiency, and price courses to cover expected variable costs under uncertainty.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

Course Engagement Cost Prediction using Bayesian Regression in ML

Libraries Required

Dataset

Step-by-Step Code Implementation

Load Data & Inspect

Preprocessing & Train/Test Split