Course Engagement Prediction using Linear Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Online-learning providers live by one metric: learner engagement—the percentage of enrolled students who consistently return, watch videos, submit quizzes, and post on forums. Accurately predicting each learner’s next‑week engagement score (e.g., minutes‑watched or a 0‑to‑100 composite) lets course teams trigger timely nudges, schedule live sessions, and personalise content.
Here we build a linear‑regression baseline that forecasts a learner’s weekly engagement score from data known up to the previous week: cumulative video minutes, quiz accuracy, forum activity, days since signup, and demographic flags. A transparent model pinpoints the first-order drivers of engagement and sets a factual benchmark before experimenting with time-series nets or uplift modelling.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # sanity‑check plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Predict Online Course Engagement
Step-by-Step Code Implementation
Why linear regression? Engagement often increases nearly linearly with cumulative study minutes and quiz mastery, while it declines with learner ageing (days since signup). A straight‑line model quantifies each lever, producing coefficients that managers can discuss in plain English.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
Download and unzip online_course_engagement.csv from Kaggle (see Section 4), then load:
df = pd.read_csv("online_course_engagement.csv")
print(df.head())
3. Feature engineering
Standard scaling puts video minutes, quiz accuracy, and forum counts on equal variance, so coefficients read as points per 1 σ change—a tidy story for dashboards.
# drop rows missing critical predictors or label
core_cols = ['video_minutes_cum', 'quiz_accuracy',
'forum_posts', 'days_since_signup',
'country', 'engagement_next']
df = df.dropna(subset=core_cols)
# flag heavy forum users
df['forum_heavy'] = (df['forum_posts'] > df['forum_posts'].median()).astype(int)
4. Define predictors & label
Forum‑heavy flag separates lurkers from active discussants, revealing how social learning correlates with persistence.
num_cols = ['video_minutes_cum', 'quiz_accuracy',
'forum_posts', 'days_since_signup']
cat_cols = ['country', 'forum_heavy'] # forum_heavy treated as categorical flag
target = 'engagement_next'
X = df[num_cols + cat_cols]
y = df[target]
5. Pre‑processing & model pipeline
One‑hot country codes control for regional usage patterns (e.g., mobile‑first vs. desktop) without imposing a false numeric order.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
R² + MAE — together show how much variance we capture and the typical absolute error in the same 0‑100 scale product teams care about.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} points")
8. Interpret influential features
The coefficient table tells growth marketers exactly which behaviours to encourage—e.g., “Every extra 30 video minutes lifts next‑week engagement by ~2 points.”
# recover one‑hot names
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_,
index=all_feats).sort_values()
print("\nNegative‑impact features (reduce engagement):")
print(coefs.head(5))
print("\nPositive‑impact features (boost engagement):")
print(coefs.tail(5))
A one‑standard‑deviation rise in quiz_accuracy might lift next‑week engagement by ~4 points, while every extra seven days since signup could shave off ~1 point.
9. Persist the trained pipeline
Joblib persistence freezes preprocessing and weights, letting tomorrow’s CRM job load .pkl, score all current learners, and trigger personalised nudges in minutes.
joblib.dump(pipe, "course_engagement_linreg.pkl")
Summary
With under a hundred lines of Python, we transformed raw LMS logs into an explainable course‑engagement forecaster. The linear model delivers:
- Instant, interpretable risk scores so instructors can reach out before learners drift away.
- Transparent behavioural levers—quantifying exactly how video time, quiz success, and forum activity tug future engagement up or down.
Keep this baseline on hand; when you pivot to gradient‑boosted trees, sequence models, or causal forests, you’ll know precisely how much extra uplift the sophistication buys.