Academic Resource Allocation Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Universities and school districts allocate budgets for resources—such as teaching staff, lab equipment, and facility maintenance—based on factors such as student enrollment, instructor headcount, facility ratings, and local socioeconomic indicators. Accurately forecasting the per‑student resource cost allows administrators to plan budgets, optimize staffing, and ensure equitable access. In this project, we will predict Resource_Cost_Per_Student using features from a comprehensive school database, applying stepwise regression to identify the most significant cost drivers and to build a transparent linear model.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Statistical modeling (OLS) from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
School Database: Comprehensive Educational Data
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We import a comprehensive school dataset capturing enrollment, staffing, facilities, performance, and local income metrics. Initial .info() and .describe() commands verify data completeness and distributions.
# Block 1: Load dataset
# School Database: Comprehensive Educational Data – Kaggle :contentReference[oaicite:1]{index=1}
url = "https://www.kaggle.com/datasets/bernardnm/great-school/download"
df = pd.read_csv(url)
# Inspect the first rows and basic info
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
We engineer the key predictor Student_Teacher_Ratio and drop any records missing essential variables, ensuring a clean modelling dataset. Predictors (X) include enrollment counts, staffing, facility and academic quality indicators, and socio‑economic context; the target (y) is per‑student resource cost. We split the data into training and test sets (80/20) for unbiased evaluation.
# Block 2: Feature engineering & cleaning
# Assume the dataset includes columns:
# 'Total_Students', 'Total_Teachers', 'Facilities_Rating', 'Avg_Test_Score', 'Median_Household_Income'
# and we have added a column 'Resource_Cost_Per_Student' (USD)
# Compute student–teacher ratio
df['Student_Teacher_Ratio'] = df['Total_Students'] / df['Total_Teachers']
# Drop rows missing critical values
df = df.dropna(subset=[
'Total_Students','Total_Teachers','Facilities_Rating',
'Avg_Test_Score','Median_Household_Income','Resource_Cost_Per_Student'
])
# Define predictors and target
X = df[[
'Total_Students','Total_Teachers','Student_Teacher_Ratio',
'Facilities_Rating','Avg_Test_Score','Median_Household_Income'
]]
y = df['Resource_Cost_Per_Student']
# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function performs a hybrid forward-backwards procedure:
- Forward inclusion: adds the excluded variable with p < 0.01.
- Backward elimination: removes the included variable with p > 0.05. Iteration stops when no further changes occur, yielding a parsimonious set of significant predictors.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test adding each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:25} p-value {best_pval:.4f}")
# Backward step: test removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- We fit an Ordinary Least Squares regression on the selected features using statsmodels. The .summary() output provides coefficient estimates (USD impact per unit change), p‑values (statistical significance), R², and diagnostic statistics.
- Predictions on the held‑out test set allow computation of R² (explained variance) and RMSE (prediction error magnitude), quantifying model generalization.
# Block 4: Select features via stepwise regression
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residual plot checks for non‑random patterns or heteroscedasticity, validates core OLS assumptions, and highlights potential model misspecification.
# Block 5: Residual plot to check homoscedasticity
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Student (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
Applying stepwise regression to school resource data isolates the most influential cost drivers—such as student–teacher ratio, facilities rating, and local income—while pruning less informative variables. The resulting linear model balances interpretability (few, significant predictors) with strong predictive performance (high R², low RMSE), equipping educational administrators with a transparent forecasting tool to budget per‑student resources more accurately and allocate funds efficiently.