Student Academic Score Prediction using Bayesian Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Academic advisors and curriculum planners need to forecast a student’s final exam score—before the end of term—using midterm performance, homework completion rate, attendance, and demographic factors. Traditional point‐estimate regressions provide a single predicted score but fail to quantify uncertainty from limited or noisy data (e.g., variable homework effort). By applying Bayesian Regression, we obtain not only a best estimate of exam performance but also credible intervals that capture uncertainty, enabling risk‐aware interventions (e.g., identifying students with high variance in predicted outcomes for extra support).

Libraries Required

import pandas as pd                              # data loading & handling  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # enhanced visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error

Dataset

Student Performance Predictions

Step-by-Step Code Implementation

Import Libraries & Load Data

import pandas as pd

# Load dataset; features include 'midterm_score','homework_rate',
# 'attendance_rate','gender','parent_education','final_score'
df = pd.read_csv("data/student_performance_predictions.csv")
df.head()

Preprocessing & Train/Test Split

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# One‑hot encode categorical features
df = pd.get_dummies(df, columns=['gender','parent_education'], drop_first=True)

# Define predictors & target
X = df.drop(columns=['final_score'])
y = df['final_score'].values

# Split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize numeric columns for stable sampling
num_cols = ['midterm_score','homework_rate','attendance_rate']
scaler = StandardScaler().fit(X_train[num_cols])
X_train_s = X_train.copy()
X_train_s[num_cols] = scaler.transform(X_train[num_cols])
X_test_s = X_test.copy()
X_test_s[num_cols] = scaler.transform(X_test[num_cols])

Define & Fit Bayesian Regression Model

Likelihood: Observed final scores are Normal around the linear predictor with noise σ.
MCMC: We draw 2,000 samples (post-burn-in of 1,000) with target_accept=0.9 to ensure stable convergence.
Priors (α, β, σ): Chosen as weakly informative normals centred on plausible ranges (e.g., intercept near average score ≈50).
Model definition: Linear predictor using standardised numeric features plus dot‐product for multiple categorical coefficients.

import pymc3 as pm

with pm.Model() as bayes_model:
    # Priors: intercept α, coefficients β, noise σ
    α = pm.Normal("α", mu=50, sigma=20)
    β_mid = pm.Normal("β_mid", mu=0, sigma=10)
    β_hw  = pm.Normal("β_hw",  mu=0, sigma=10)
    β_att = pm.Normal("β_att", mu=0, sigma=10)
    # One β per dummy categorical column
    β_cat = pm.Normal("β_cat", mu=0, sigma=5, shape=X_train_s.shape[1] - 3)
    σ = pm.HalfNormal("σ", sigma=10)
    
    # Expected final score
    mu = (
        α
        + β_mid  * X_train_s['midterm_score']
        + β_hw   * X_train_s['homework_rate']
        + β_att  * X_train_s['attendance_rate']
        + pm.math.dot(X_train_s.drop(columns=num_cols), β_cat)
    )
    
    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=σ, observed=y_train)
    
    # Sample posterior
    trace = pm.sample(
        draws=2000, tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Prediction

Posterior predictive: Samples from the posterior predictive distribution provide a complete representation of predictive uncertainty.
Predictions: We use posterior means of the coefficients as point estimates and compute MAE on held-out test data.

import arviz as az

# Posterior summary
az.summary(trace, round_to=2)

# Posterior predictive sampling on training model
with bayes_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Compute posterior mean coefficients
post = trace.posterior
α_post   = post['α'].mean().item()
β_mid_p  = post['β_mid'].mean().item()
β_hw_p   = post['β_hw'].mean().item()
β_att_p  = post['β_att'].mean().item()
β_cat_p  = post['β_cat'].mean(dim=['chain','draw']).values

# Predict on test set
Xc = X_test_s.drop(columns=num_cols).values
y_pred = (
    α_post
    + β_mid_p  * X_test_s['midterm_score']
    + β_hw_p   * X_test_s['homework_rate']
    + β_att_p  * X_test_s['attendance_rate']
    + Xc.dot(β_cat_p)
)

# Evaluate
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: {mae:.2f} points")

Visualise Posterior Predictive & Credible Intervals

Plotting the posterior mean and 94% credible bands against midterm scores illustrates both central tendency and uncertainty.

# Generate grid of midterm scores (fixed median other features)
mid_grid = np.linspace(0, 100, 50)
# Build DataFrame for grid predictions
df_grid = pd.DataFrame({
    'midterm_score': mid_grid,
    'homework_rate': X_train_s['homework_rate'].median(),
    'attendance_rate': X_train_s['attendance_rate'].median(),
    **{col: X_train_s[col].median() for col in X_train_s.drop(columns=num_cols).columns}
})

# Standardize grid numeric
df_grid[num_cols] = scaler.transform(df_grid[num_cols])

# Sample posterior predictive for grid
with bayes_model:
    pm.set_data({**{
      'midterm_score': df_grid['midterm_score'],
      'homework_rate': df_grid['homework_rate'],
      'attendance_rate': df_grid['attendance_rate']
    },
    **{f: df_grid[f] for f in df_grid.columns if f not in num_cols}})
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Compute mean and 94% credible interval
preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)

plt.figure(figsize=(8,5))
plt.plot(mid_grid, mean_pred, label="Posterior mean")
plt.fill_between(mid_grid, hpd[:,0], hpd[:,1], alpha=0.3, label="94% CI")
plt.scatter(X_test['midterm_score'], y_test, color='k', alpha=0.5, label="Test data")
plt.xlabel("Midterm Score")
plt.ylabel("Final Exam Score")
plt.title("Bayesian Regression: Final Score vs. Midterm")
plt.legend()
plt.show()

Summary

This Bayesian Regression workflow delivers:

Uncertainty‐aware predictions of final exam scores—with credible intervals that inform confidence in each forecast.
Regularisation via priors, mitigating overfitting when data are limited or noisy.
Actionable insights: educators can identify students with high uncertainty in their predicted scores (wide confidence intervals) to target support, and understand how midterm exams, homework, attendance, and demographics drive outcomes.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

Student Academic Score Prediction using Bayesian Regression in ML

Libraries Required

Dataset

Step-by-Step Code Implementation

Import Libraries & Load Data

Preprocessing & Train/Test Split