Exam Performance Prediction using Bayesian Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Academic advisors and institutions need to forecast a student’s final exam score using early indicators—such as midterm exam score, homework completion rate, attendance percentage, and test‑preparation course completion—before the term ends. Final grades exhibit nonlinear dependencies on these predictors and varying uncertainty, driven by student engagement and external factors. By applying Bayesian Regression, we produce not only a point estimate of the final score but also credible intervals that quantify our uncertainty, enabling targeted interventions for at-risk students and more informed resource allocation.

Libraries Required

import pandas as pd                              # data loading & handling  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # enhanced visualization  

import pymc3 as pm                               # Bayesian modeling  
import arviz as az                               # posterior analysis  

from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_absolute_error 

Dataset

Students Performance in Exams

Step-by-Step Code Implementation

Import Libraries & Load Data

import pandas as pd

# Load dataset
df = pd.read_csv("data/students-performance-in-exams/StudentsPerformance.csv")

# Select relevant columns and preview
df = df.rename(columns={
    'math score':    'midterm_score',
    'reading score': 'reading_score',
    'writing score': 'final_score'
})
df[['midterm_score','reading_score','final_score',
    'gender','race/ethnicity','parental level of education',
    'test preparation course']].head()

Preprocessing & Train/Test Split

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Encode categorical predictors
df = pd.get_dummies(df, columns=[
    'gender','race/ethnicity','parental level of education',
    'test preparation course'
], drop_first=True)

# Define features & target
X = df.drop(columns=['final_score'])
y = df['final_score'].values

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize numeric columns
num_cols = ['midterm_score','reading_score']
scaler = StandardScaler().fit(X_train[num_cols])
X_train_s = X_train.copy()
X_train_s[num_cols] = scaler.transform(X_train[num_cols])
X_test_s  = X_test.copy()
X_test_s[num_cols] = scaler.transform(X_test[num_cols])

Define & Fit Bayesian Regression Model

Priors (α, β, σ):

  • α centred at 50 (typical score), wide σ=20 to allow shift.
  • β for each predictor: Normal(0,10) reflects moderate belief in effect sizes.
  • σ HalfNormal(10) constrains the residual scale positively.

Model: Linear predictor μ = α + β·X_standardized links predictors (midterm, reading, demographics, test‑prep) to final scores.

Likelihood: Observed final_score ∼ Normal(μ,σ).

Sampling: 2,000 posterior draws after 1,000 tuning steps, with target_accept=0.9, ensure reliable convergence.

import pymc3 as pm

with pm.Model() as exam_model:
    # Priors
    α = pm.Normal("α", mu=50, sigma=20)
    β = pm.Normal("β", mu=0, sigma=10, shape=X_train_s.shape[1])
    σ = pm.HalfNormal("σ", sigma=10)
    
    # Linear predictor
    μ = α + pm.math.dot(X_train_s.values, β)
    
    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=y_train)
    
    # MCMC sampling
    trace = pm.sample(
        draws=2000, tune=1000,
        target_accept=0.9,
        return_inferencedata=True
    )

Posterior Analysis & Point Predictions

  • Samples from Y_obs give a distribution over final scores for new students.
  • Using posterior means of α and β yields mean forecasts; MAE quantifies average error on held‑out data.
import arviz as az

# Posterior summary
az.summary(trace, round_to=2)

# Posterior predictive sampling
with exam_model:
    ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

# Posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values

# Predict on standardized test set
y_pred = α_post + X_test_s.values.dot(β_post)

# Evaluate MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: {mae:.2f} points")

Visualise Posterior Predictive & Credible Intervals

Plotting the final vs midterm relationship with 94% credible bands shows both central tendency and uncertainty, guiding academic support decisions.

# Vary midterm_score, hold other features at median
mid_grid = np.linspace(
    X_train_s['midterm_score'].min(),
    X_train_s['midterm_score'].max(), 50
)
median_vals = X_train_s.median()

grid = pd.DataFrame(
    np.repeat(median_vals.values[None,:], 50, axis=0),
    columns=X_train_s.columns
)
grid['midterm_score'] = mid_grid

with exam_model:
    pm.set_data({"X": grid.values})
    ppc_grid = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])

preds = ppc_grid["Y_obs"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)

# Transform midterm back
mid_orig = scaler.inverse_transform(
    np.column_stack([mid_grid, grid['reading_score']])
)[:,0]

plt.figure(figsize=(8,5))
plt.plot(mid_orig, mean_pred, label="Posterior mean")
plt.fill_between(mid_orig, hpd[:,0], hpd[:,1],
                 alpha=0.3, label="94% CI")
plt.scatter(X_test['midterm_score'], y_test,
            color='k', alpha=0.5, label="Test data")
plt.xlabel("Midterm Score")
plt.ylabel("Final Exam Score")
plt.title("Bayesian Regression: Final vs. Midterm Scores")
plt.legend()
plt.show()

Summary

This Bayesian Regression workflow for exam‑score prediction provides:

1. Point forecasts of final exam scores from early academic indicators.

2. Credible intervals quantifying forecast uncertainty—critical for identifying students needing intervention.

3. Actionable insights: educators can use both expected scores and uncertainty ranges to tailor support, allocate tutoring resources, and improve academic outcomes.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *