Student Dropout Cost Prediction using Quantile Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

When students leave school prematurely, institutions incur both direct costs (lost tuition, administrative overhead) and indirect costs (diminished long‑term outcomes, reputational impacts). Rather than forecasting only the average cost per dropout, stakeholders need to understand the distribution of costs—from relatively low‑impact cases (25th percentile) to extreme, high‑impact scenarios (75th percentile).

In this project, we’ll predict the 25th, 50th, and 75th percentiles of the institutional cost per student dropout using demographic, academic, and financial features (parental income, prior GPA, attendance rate, scholarship status).

By fitting separate quantile regression models, we’ll reveal how drivers influence lower‑, median‑, and upper‑tail dropout costs—enabling administrators to budget conservatively, plan typical support interventions, and provision for worst‑case loss scenarios.

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf   # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss for quantile forecasts

Dataset

Student’s Dropout and Academic Success Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We import the dataset—containing student demographics, academic metrics, support flags, and the precomputed Cost_per_Dropout (USD)—and inspect its structure to confirm data types and cost distribution.

# Load the student dropout dataset
# Source: Student's Dropout and Academic Success Dataset – Kaggle :contentReference[oaicite:1]{index=1}
df = pd.read_csv("students_dropout_and_academic_success.csv")

# Inspect structure and key statistics
print(df.head())
print(df.info())
print(df[['Cost_per_Dropout']].describe())

Preprocessing & Feature Engineering

We map the Scholarship flag to a binary indicator.
We drop any records missing core predictors or the cost target.
We select five predictors: Parental_Income, Prior_GPA, Attendance_Rate, Scholarship, and Extracurricular_Participation, and rename the target to Cost for brevity.

# Assume the dataset includes:
# 'Parental_Income', 'Prior_GPA', 'Attendance_Rate',
# 'Scholarship', 'Extracurricular_Participation', and 'Cost_per_Dropout'

# Convert scholarship flag to binary
df['Scholarship'] = df['Scholarship'].map({'Yes':1, 'No':0})

# Ensure all rows have non‑missing cost
df = df.dropna(subset=[
    'Parental_Income','Prior_GPA','Attendance_Rate',
    'Scholarship','Extracurricular_Participation',
    'Cost_per_Dropout'
])

# Optionally scale numeric predictors (not required for interpretability)
# Define features and response
features = [
    'Parental_Income','Prior_GPA','Attendance_Rate',
    'Scholarship','Extracurricular_Participation'
]
df.rename(columns={'Cost_per_Dropout':'Cost'}, inplace=True)
data = df[features + ['Cost']]

Train/Test Split

We randomly reserve 20% of the data to evaluate quantile forecasts on unseen students, ensuring our models generalize.

# Hold out 20% for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each target quantile (25th, 50th, 75th):

We build a formula string (Cost ~ Parental_Income + Prior_GPA + …).
We fit a QuantReg model at that percentile on the training set.
We print the coefficient table, which shows how each predictor’s marginal effect shifts across the lower, median, and upper cost distribution (e.g., Attendance_Rate may reduce the 75th‑percentile cost more than the median).

quantiles = [0.25, 0.50, 0.75]
results   = {}
formula   = "Cost ~ " + " + ".join(features)

for q in quantiles:
    model = smf.quantreg(formula, train)
    res   = model.fit(q=q)
    results[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # coefficient table only

Evaluation with Pinball Loss

We predict quantile‑specific costs on the test set and compute pinball loss for each model—a proper scoring rule for quantile forecasts that penalizes under‑ and over‑predictions asymmetrically. Lower pinball loss indicates better quantile calibration.

for q, res in results.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Cost'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

Quantile regression illuminates the heterogeneous drivers of dropout cost across the cost distribution:

The 25th‑percentile model captures factors influencing lower‑impact dropout cases (e.g., partial tuition loss).
The median model forecasts typical dropout cost scenarios for budgeting.
The 75th‑percentile model highlights risk factors leading to high‑impact, high‑cost student losses (e.g., students on full scholarships).

Therefore, by providing distribution‑aware cost forecasts, educational institutions can allocate financial reserves conservatively, plan average intervention budgets, and set aside contingency funds for worst‑case dropout scenarios—thereby managing financial risk with precision.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook