School Performance Quantile Prediction in ML
FREE Online Courses: Enroll Now, Thank us Later!
Standard regression models predict the average student performance, but educators often need to understand how interventions affect different points in the score distribution—e.g., raising the bottom 25th percentile or pushing top performers further.
In this project, we will predict quantiles (for example, the 10th, 50th, and 90th percentiles) of students’ math scores. It will be based on demographic and academic inputs (gender, parental education, test preparation, reading, and writing scores).
By applying quantile regression, we’ll build separate linear models for each quantile, uncovering heterogeneous effects of predictors across the performance spectrum. Therefore, helping schools tailor support for struggling students while nurturing high achievers.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.formula.api as smf # Quantile regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import mean_pinball_loss # Evaluation for quantiles
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We load demographic and exam scores for ~1000 students, then inspect types and basic statistics to understand distributions (e.g., math scores range ~0–100).
# Load the “Students Performance in Exams” dataset
# Source: Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("StudentsPerformance.csv")
# Quick inspection
print(df.head())
print(df.info())
print(df.describe())
Preprocessing
- We map gender to binary and one‑hot encode multi‑category fields (race/ethnicity, parental education, lunch program, test prep) to prepare for regression.
- We separate predictors X from the target y (math score).
# Map categorical scores to numeric and one-hot encode demographics
df['gender'] = df['gender'].map({'female':0, 'male':1})
df = pd.get_dummies(df,
columns=['race/ethnicity','parental level of education',
'lunch','test preparation course'],
drop_first=True)
# Define predictors and target
features = [c for c in df.columns if c != 'math score']
X = df[features]
y = df['math score']
Train/Test Split
We hold out 20% of the data to evaluate the quantile model performance on unseen students.
# 80/20 split for evaluation
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Fit Quantile Regression Models
For each chosen quantile (10th, 50th, 90th percentile):
- We build a statsmodels formula relating math score to all predictors.
- We fit a QuantReg model at that quantile.
- Printing .summary() reveals coefficient estimates and significance for each predictor at that point in the score distribution (e.g., “test preparation” might have a larger effect at the lower quantiles).
# Specify quantiles of interest
quantiles = [0.1, 0.5, 0.9]
models = {}
for q in quantiles:
# Build formula string: math score ~ all predictors
formula = "Q('math score') ~ " + " + ".join(features)
# Fit quantile regression
mod = smf.quantreg(formula, pd.concat([X_train, y_train], axis=1))
res = mod.fit(q=q)
models[q] = res
print(f"\n=== Quantile {q:.0%} Summary ===")
print(res.summary())
Evaluation with Pinball Loss
We predict on the test set and compute pinball loss, the appropriate error metric for quantile forecasts.
However, quantifying average under/over‑prediction penalties and allowing comparison across quantiles.
from sklearn.metrics import mean_pinball_loss
for q, res in models.items():
# Predict on test set
y_pred = res.predict(X_test)
# Compute pinball loss for this quantile
loss = mean_pinball_loss(y_test, y_pred, alpha=q)
print(f"Quantile {q:.0%} Pinball Loss: {loss:.2f}")
Summary
By using quantile regression on student performance data, we have seen how various factors like parental education, test preparation, and demographic variables differentially influence the lower, median, and upper ends of the math score distribution.
Unlike ordinary least squares, quantile regression provides tail‑specific insights. For instance, revealing which interventions most effectively boost struggling students versus those who are already high performers.
The resulting set of interpretable linear models equips educators and policymakers with targeted strategies to improve equity and excellence across the full spectrum of student achievement.