Student Dropout Cost Prediction using Quantile Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
When students leave school prematurely, institutions incur both direct costs (lost tuition, administrative overhead) and indirect costs (diminished long‑term outcomes, reputational impacts). Rather than forecasting only the average cost per dropout, stakeholders need to understand the distribution of costs—from relatively low‑impact cases (25th percentile) to extreme, high‑impact scenarios (75th percentile).
In this project, we’ll predict the 25th, 50th, and 75th percentiles of the institutional cost per student dropout using demographic, academic, and financial features (parental income, prior GPA, attendance rate, scholarship status).
By fitting separate quantile regression models, we’ll reveal how drivers influence lower‑, median‑, and upper‑tail dropout costs—enabling administrators to budget conservatively, plan typical support interventions, and provision for worst‑case loss scenarios.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts
Dataset
Student’s Dropout and Academic Success Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We import the dataset—containing student demographics, academic metrics, support flags, and the precomputed Cost_per_Dropout (USD)—and inspect its structure to confirm data types and cost distribution.
# Load the student dropout dataset
# Source: Student's Dropout and Academic Success Dataset – Kaggle :contentReference[oaicite:1]{index=1}
df = pd.read_csv("students_dropout_and_academic_success.csv")
# Inspect structure and key statistics
print(df.head())
print(df.info())
print(df[['Cost_per_Dropout']].describe())
Preprocessing & Feature Engineering
- We map the Scholarship flag to a binary indicator.
- We drop any records missing core predictors or the cost target.
- We select five predictors: Parental_Income, Prior_GPA, Attendance_Rate, Scholarship, and Extracurricular_Participation, and rename the target to Cost for brevity.
# Assume the dataset includes:
# 'Parental_Income', 'Prior_GPA', 'Attendance_Rate',
# 'Scholarship', 'Extracurricular_Participation', and 'Cost_per_Dropout'
# Convert scholarship flag to binary
df['Scholarship'] = df['Scholarship'].map({'Yes':1, 'No':0})
# Ensure all rows have non‑missing cost
df = df.dropna(subset=[
'Parental_Income','Prior_GPA','Attendance_Rate',
'Scholarship','Extracurricular_Participation',
'Cost_per_Dropout'
])
# Optionally scale numeric predictors (not required for interpretability)
# Define features and response
features = [
'Parental_Income','Prior_GPA','Attendance_Rate',
'Scholarship','Extracurricular_Participation'
]
df.rename(columns={'Cost_per_Dropout':'Cost'}, inplace=True)
data = df[features + ['Cost']]
Train/Test Split
We randomly reserve 20% of the data to evaluate quantile forecasts on unseen students, ensuring our models generalize.
# Hold out 20% for evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each target quantile (25th, 50th, 75th):
- We build a formula string (Cost ~ Parental_Income + Prior_GPA + …).
- We fit a QuantReg model at that percentile on the training set.
- We print the coefficient table, which shows how each predictor’s marginal effect shifts across the lower, median, and upper cost distribution (e.g., Attendance_Rate may reduce the 75th‑percentile cost more than the median).
quantiles = [0.25, 0.50, 0.75]
results = {}
formula = "Cost ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # coefficient table only
Evaluation with Pinball Loss
We predict quantile‑specific costs on the test set and compute pinball loss for each model—a proper scoring rule for quantile forecasts that penalizes under‑ and over‑predictions asymmetrically. Lower pinball loss indicates better quantile calibration.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Cost'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
Quantile regression illuminates the heterogeneous drivers of dropout cost across the cost distribution:
- The 25th‑percentile model captures factors influencing lower‑impact dropout cases (e.g., partial tuition loss).
- The median model forecasts typical dropout cost scenarios for budgeting.
- The 75th‑percentile model highlights risk factors leading to high‑impact, high‑cost student losses (e.g., students on full scholarships).
Therefore, by providing distribution‑aware cost forecasts, educational institutions can allocate financial reserves conservatively, plan average intervention budgets, and set aside contingency funds for worst‑case dropout scenarios—thereby managing financial risk with precision.