School Performance Quantile Prediction in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Standard regression models predict the average student performance, but educators often need to understand how interventions affect different points in the score distribution—e.g., raising the bottom 25th percentile or pushing top performers further.

In this project, we will predict quantiles (for example, the 10th, 50th, and 90th percentiles) of students’ math scores. It will be based on demographic and academic inputs (gender, parental education, test preparation, reading, and writing scores).

By applying quantile regression, we’ll build separate linear models for each quantile, uncovering heterogeneous effects of predictors across the performance spectrum. Therefore, helping schools tailor support for struggling students while nurturing high achievers.

Libraries Required

import pandas as pd                 # Data loading & manipulation  
import numpy as np                  # Numerical operations  
import statsmodels.formula.api as smf  # Quantile regression  
from sklearn.model_selection import train_test_split  # Train/test split  
from sklearn.metrics import mean_pinball_loss         # Evaluation for quantiles

Dataset

Students Performance in Exams

Step-by-Step Code Implementation

Load & Inspect Data

We load demographic and exam scores for ~1000 students, then inspect types and basic statistics to understand distributions (e.g., math scores range ~0–100).

# Load the “Students Performance in Exams” dataset
# Source: Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("StudentsPerformance.csv")

# Quick inspection
print(df.head())
print(df.info())
print(df.describe())

Preprocessing

We map gender to binary and one‑hot encode multi‑category fields (race/ethnicity, parental education, lunch program, test prep) to prepare for regression.
We separate predictors X from the target y (math score).

# Map categorical scores to numeric and one-hot encode demographics
df['gender'] = df['gender'].map({'female':0, 'male':1})
df = pd.get_dummies(df, 
                    columns=['race/ethnicity','parental level of education',
                             'lunch','test preparation course'],
                    drop_first=True)

# Define predictors and target
features = [c for c in df.columns if c != 'math score']
X = df[features]
y = df['math score']

Train/Test Split

We hold out 20% of the data to evaluate the quantile model performance on unseen students.

# 80/20 split for evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Fit Quantile Regression Models

For each chosen quantile (10th, 50th, 90th percentile):

We build a statsmodels formula relating math score to all predictors.
We fit a QuantReg model at that quantile.
Printing .summary() reveals coefficient estimates and significance for each predictor at that point in the score distribution (e.g., “test preparation” might have a larger effect at the lower quantiles).

# Specify quantiles of interest
quantiles = [0.1, 0.5, 0.9]
models = {}
for q in quantiles:
    # Build formula string: math score ~ all predictors
    formula = "Q('math score') ~ " + " + ".join(features)
    # Fit quantile regression
    mod = smf.quantreg(formula, pd.concat([X_train, y_train], axis=1))
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n=== Quantile {q:.0%} Summary ===")
    print(res.summary())

Evaluation with Pinball Loss

We predict on the test set and compute pinball loss, the appropriate error metric for quantile forecasts.

However, quantifying average under/over‑prediction penalties and allowing comparison across quantiles.

from sklearn.metrics import mean_pinball_loss

for q, res in models.items():
    # Predict on test set
    y_pred = res.predict(X_test)
    # Compute pinball loss for this quantile
    loss = mean_pinball_loss(y_test, y_pred, alpha=q)
    print(f"Quantile {q:.0%} Pinball Loss: {loss:.2f}")

Summary

By using quantile regression on student performance data, we have seen how various factors like parental education, test preparation, and demographic variables differentially influence the lower, median, and upper ends of the math score distribution.

Unlike ordinary least squares, quantile regression provides tail‑specific insights. For instance, revealing which interventions most effectively boost struggling students versus those who are already high performers.

The resulting set of interpretable linear models equips educators and policymakers with targeted strategies to improve equity and excellence across the full spectrum of student achievement.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook