Student Academic Score Prediction using Bayesian Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Academic advisors and curriculum planners need to forecast a student’s final exam score—before the end of term—using midterm performance, homework completion rate, attendance, and demographic factors. Traditional point‐estimate regressions provide a single predicted score but fail to quantify uncertainty from limited or noisy data (e.g., variable homework effort). By applying Bayesian Regression, we obtain not only a best estimate of exam performance but also credible intervals that capture uncertainty, enabling risk‐aware interventions (e.g., identifying students with high variance in predicted outcomes for extra support).
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Student Performance Predictions
Step-by-Step Code Implementation
Import Libraries & Load Data
import pandas as pd
# Load dataset; features include 'midterm_score','homework_rate',
# 'attendance_rate','gender','parent_education','final_score'
df = pd.read_csv("data/student_performance_predictions.csv")
df.head()
Preprocessing & Train/Test Split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# One‑hot encode categorical features
df = pd.get_dummies(df, columns=['gender','parent_education'], drop_first=True)
# Define predictors & target
X = df.drop(columns=['final_score'])
y = df['final_score'].values
# Split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize numeric columns for stable sampling
num_cols = ['midterm_score','homework_rate','attendance_rate']
scaler = StandardScaler().fit(X_train[num_cols])
X_train_s = X_train.copy()
X_train_s[num_cols] = scaler.transform(X_train[num_cols])
X_test_s = X_test.copy()
X_test_s[num_cols] = scaler.transform(X_test[num_cols])
Define & Fit Bayesian Regression Model
- Likelihood: Observed final scores are Normal around the linear predictor with noise σ.
- MCMC: We draw 2,000 samples (post-burn-in of 1,000) with target_accept=0.9 to ensure stable convergence.
- Priors (α, β, σ): Chosen as weakly informative normals centred on plausible ranges (e.g., intercept near average score ≈50).
- Model definition: Linear predictor using standardised numeric features plus dot‐product for multiple categorical coefficients.
import pymc3 as pm
with pm.Model() as bayes_model:
# Priors: intercept α, coefficients β, noise σ
α = pm.Normal("α", mu=50, sigma=20)
β_mid = pm.Normal("β_mid", mu=0, sigma=10)
β_hw = pm.Normal("β_hw", mu=0, sigma=10)
β_att = pm.Normal("β_att", mu=0, sigma=10)
# One β per dummy categorical column
β_cat = pm.Normal("β_cat", mu=0, sigma=5, shape=X_train_s.shape[1] - 3)
σ = pm.HalfNormal("σ", sigma=10)
# Expected final score
mu = (
α
+ β_mid * X_train_s['midterm_score']
+ β_hw * X_train_s['homework_rate']
+ β_att * X_train_s['attendance_rate']
+ pm.math.dot(X_train_s.drop(columns=num_cols), β_cat)
)
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=σ, observed=y_train)
# Sample posterior
trace = pm.sample(
draws=2000, tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Prediction
- Posterior predictive: Samples from the posterior predictive distribution provide a complete representation of predictive uncertainty.
- Predictions: We use posterior means of the coefficients as point estimates and compute MAE on held-out test data.
import arviz as az
# Posterior summary
az.summary(trace, round_to=2)
# Posterior predictive sampling on training model
with bayes_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Compute posterior mean coefficients
post = trace.posterior
α_post = post['α'].mean().item()
β_mid_p = post['β_mid'].mean().item()
β_hw_p = post['β_hw'].mean().item()
β_att_p = post['β_att'].mean().item()
β_cat_p = post['β_cat'].mean(dim=['chain','draw']).values
# Predict on test set
Xc = X_test_s.drop(columns=num_cols).values
y_pred = (
α_post
+ β_mid_p * X_test_s['midterm_score']
+ β_hw_p * X_test_s['homework_rate']
+ β_att_p * X_test_s['attendance_rate']
+ Xc.dot(β_cat_p)
)
# Evaluate
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: {mae:.2f} points")
Visualise Posterior Predictive & Credible Intervals
Plotting the posterior mean and 94% credible bands against midterm scores illustrates both central tendency and uncertainty.
# Generate grid of midterm scores (fixed median other features)
mid_grid = np.linspace(0, 100, 50)
# Build DataFrame for grid predictions
df_grid = pd.DataFrame({
'midterm_score': mid_grid,
'homework_rate': X_train_s['homework_rate'].median(),
'attendance_rate': X_train_s['attendance_rate'].median(),
**{col: X_train_s[col].median() for col in X_train_s.drop(columns=num_cols).columns}
})
# Standardize grid numeric
df_grid[num_cols] = scaler.transform(df_grid[num_cols])
# Sample posterior predictive for grid
with bayes_model:
pm.set_data({**{
'midterm_score': df_grid['midterm_score'],
'homework_rate': df_grid['homework_rate'],
'attendance_rate': df_grid['attendance_rate']
},
**{f: df_grid[f] for f in df_grid.columns if f not in num_cols}})
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Compute mean and 94% credible interval
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
plt.figure(figsize=(8,5))
plt.plot(mid_grid, mean_pred, label="Posterior mean")
plt.fill_between(mid_grid, hpd[:,0], hpd[:,1], alpha=0.3, label="94% CI")
plt.scatter(X_test['midterm_score'], y_test, color='k', alpha=0.5, label="Test data")
plt.xlabel("Midterm Score")
plt.ylabel("Final Exam Score")
plt.title("Bayesian Regression: Final Score vs. Midterm")
plt.legend()
plt.show()
Summary
This Bayesian Regression workflow delivers:
- Uncertainty‐aware predictions of final exam scores—with credible intervals that inform confidence in each forecast.
- Regularisation via priors, mitigating overfitting when data are limited or noisy.
- Actionable insights: educators can identify students with high uncertainty in their predicted scores (wide confidence intervals) to target support, and understand how midterm exams, homework, attendance, and demographics drive outcomes.