Academic Resource Allocation Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Universities and school districts allocate budgets for resources—such as teaching staff, lab equipment, and facility maintenance—based on factors such as student enrollment, instructor headcount, facility ratings, and local socioeconomic indicators. Accurately forecasting the per‑student resource cost allows administrators to plan budgets, optimize staffing, and ensure equitable access. In this project, we will predict Resource_Cost_Per_Student using features from a comprehensive school database, applying stepwise regression to identify the most significant cost drivers and to build a transparent linear model.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Statistical modeling (OLS)  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

School Database: Comprehensive Educational Data

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import a comprehensive school dataset capturing enrollment, staffing, facilities, performance, and local income metrics. Initial .info() and .describe() commands verify data completeness and distributions.

# Block 1: Load dataset
# School Database: Comprehensive Educational Data – Kaggle :contentReference[oaicite:1]{index=1}
url = "https://www.kaggle.com/datasets/bernardnm/great-school/download"
df = pd.read_csv(url)

# Inspect the first rows and basic info
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We engineer the key predictor Student_Teacher_Ratio and drop any records missing essential variables, ensuring a clean modelling dataset. Predictors (X) include enrollment counts, staffing, facility and academic quality indicators, and socio‑economic context; the target (y) is per‑student resource cost. We split the data into training and test sets (80/20) for unbiased evaluation.

# Block 2: Feature engineering & cleaning
# Assume the dataset includes columns:
# 'Total_Students', 'Total_Teachers', 'Facilities_Rating', 'Avg_Test_Score', 'Median_Household_Income'
# and we have added a column 'Resource_Cost_Per_Student' (USD)

# Compute student–teacher ratio
df['Student_Teacher_Ratio'] = df['Total_Students'] / df['Total_Teachers']

# Drop rows missing critical values
df = df.dropna(subset=[
    'Total_Students','Total_Teachers','Facilities_Rating',
    'Avg_Test_Score','Median_Household_Income','Resource_Cost_Per_Student'
])

# Define predictors and target
X = df[[
    'Total_Students','Total_Teachers','Student_Teacher_Ratio',
    'Facilities_Rating','Avg_Test_Score','Median_Household_Income'
]]
y = df['Resource_Cost_Per_Student']

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function performs a hybrid forward-backwards procedure:

Forward inclusion: adds the excluded variable with p < 0.01.
Backward elimination: removes the included variable with p > 0.05. Iteration stops when no further changes occur, yielding a parsimonious set of significant predictors.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step: test removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression on the selected features using statsmodels. The .summary() output provides coefficient estimates (USD impact per unit change), p‑values (statistical significance), R², and diagnostic statistics.
Predictions on the held‑out test set allow computation of R² (explained variance) and RMSE (prediction error magnitude), quantifying model generalization.

# Block 4: Select features via stepwise regression
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual plot checks for non‑random patterns or heteroscedasticity, validates core OLS assumptions, and highlights potential model misspecification.

# Block 5: Residual plot to check homoscedasticity
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Student (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

Applying stepwise regression to school resource data isolates the most influential cost drivers—such as student–teacher ratio, facilities rating, and local income—while pruning less informative variables. The resulting linear model balances interpretability (few, significant predictors) with strong predictive performance (high R², low RMSE), equipping educational administrators with a transparent forecasting tool to budget per‑student resources more accurately and allocate funds efficiently.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook