Course Engagement Prediction using Linear Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Online-learning providers live by one metric: learner engagement—the percentage of enrolled students who consistently return, watch videos, submit quizzes, and post on forums. Accurately predicting each learner’s next‑week engagement score (e.g., minutes‑watched or a 0‑to‑100 composite) lets course teams trigger timely nudges, schedule live sessions, and personalise content.

Here we build a linear‑regression baseline that forecasts a learner’s weekly engagement score from data known up to the previous week: cumulative video minutes, quiz accuracy, forum activity, days since signup, and demographic flags. A transparent model pinpoints the first-order drivers of engagement and sets a factual benchmark before experimenting with time-series nets or uplift modelling.

Libraries Required

pandas # data wrangling
numpy # numerical helpers
matplotlib.pyplot # sanity‑check plots
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Predict Online Course Engagement

Step-by-Step Code Implementation

Why linear regression? Engagement often increases nearly linearly with cumulative study minutes and quiz mastery, while it declines with learner ageing (days since signup). A straight‑line model quantifies each lever, producing coefficients that managers can discuss in plain English.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the data

Download and unzip online_course_engagement.csv from Kaggle (see Section 4), then load:

df = pd.read_csv("online_course_engagement.csv")
print(df.head())

3. Feature engineering

Standard scaling puts video minutes, quiz accuracy, and forum counts on equal variance, so coefficients read as points per 1 σ change—a tidy story for dashboards.

# drop rows missing critical predictors or label
core_cols = ['video_minutes_cum', 'quiz_accuracy',
             'forum_posts', 'days_since_signup',
             'country', 'engagement_next']
df = df.dropna(subset=core_cols)

# flag heavy forum users
df['forum_heavy'] = (df['forum_posts'] > df['forum_posts'].median()).astype(int)

4.  Define predictors & label

Forum‑heavy flag separates lurkers from active discussants, revealing how social learning correlates with persistence.

num_cols = ['video_minutes_cum', 'quiz_accuracy',
            'forum_posts', 'days_since_signup']

cat_cols = ['country', 'forum_heavy']   # forum_heavy treated as categorical flag
target   = 'engagement_next'

X = df[num_cols + cat_cols]
y = df[target]

5.  Pre‑processing & model pipeline

One‑hot country codes control for regional usage patterns (e.g., mobile‑first vs. desktop) without imposing a false numeric order.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),	
        ('model', linreg)
])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Evaluation

R² + MAE — together show how much variance we capture and the typical absolute error in the same 0‑100 scale product teams care about.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} points")

8.  Interpret influential features

The coefficient table tells growth marketers exactly which behaviours to encourage—e.g., “Every extra 30 video minutes lifts next‑week engagement by ~2 points.”

# recover one‑hot names
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_,
                  index=all_feats).sort_values()

print("\nNegative‑impact features (reduce engagement):")
print(coefs.head(5))

print("\nPositive‑impact features (boost engagement):")
print(coefs.tail(5))

A one‑standard‑deviation rise in quiz_accuracy might lift next‑week engagement by ~4 points, while every extra seven days since signup could shave off ~1 point.

9. Persist the trained pipeline

Joblib persistence freezes preprocessing and weights, letting tomorrow’s CRM job load .pkl, score all current learners, and trigger personalised nudges in minutes.

joblib.dump(pipe, "course_engagement_linreg.pkl")

Summary

With under a hundred lines of Python, we transformed raw LMS logs into an explainable course‑engagement forecaster. The linear model delivers:

Instant, interpretable risk scores so instructors can reach out before learners drift away.
Transparent behavioural levers—quantifying exactly how video time, quiz success, and forum activity tug future engagement up or down.

Keep this baseline on hand; when you pivot to gradient‑boosted trees, sequence models, or causal forests, you’ll know precisely how much extra uplift the sophistication buys.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook