Classroom Resource Cost Prediction using Stepwise Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

School administrators and district finance officers must budget for classroom resources—textbooks, technology, lab equipment, and furniture—based on factors like enrollment size, grade level mix, facilities age, and neighborhood socio‑economic status. In this project, we will predict the annual resource cost per classroom using features drawn from a comprehensive school database. By applying stepwise regression, we’ll isolate the most significant cost drivers and build an interpretable linear model that balances simplicity with predictive performance—helping districts allocate budgets more effectively.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

School Database: Comprehensive Educational Data

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import the “Great School” dataset, which contains demographic, academic, and financial variables for a large sample of U.S. schools.

# Block 1: Load dataset
# School Database: Comprehensive Educational Data – Kaggle :contentReference[oaicite:1]{index=1}
url = "https://www.kaggle.com/datasets/bernardnm/great-school/download"
df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())

Feature Engineering & Preprocessing

We calculate Cost_per_Classroom by dividing total expenditure by the number of classrooms, and derive Student_Teacher_Ratio as a potential predictor. We drop any records that are missing key variables to ensure data integrity.

# Block 2: Compute cost per classroom and encode features
# Assume df includes: 'Total_Students', 'Num_Classrooms', 'Facilities_Age',
# 'Pct_Free_Lunch', 'Avg_Test_Score', 'Median_Household_Income', and we add:
df['Cost_per_Classroom'] = df['Total_Expenditure_USD'] / df['Num_Classrooms']

# Student–teacher ratio as a potential driver
df['Student_Teacher_Ratio'] = df['Total_Students'] / df['Total_Teachers']

# Drop rows missing any of our key predictors or the target
df = df.dropna(subset=[
    'Cost_per_Classroom','Student_Teacher_Ratio','Facilities_Age',
    'Pct_Free_Lunch','Avg_Test_Score','Median_Household_Income'
])

# Define predictors and target
X = df[[
    'Student_Teacher_Ratio','Facilities_Age','Pct_Free_Lunch',
    'Avg_Test_Score','Median_Household_Income'
]]
y = df['Cost_per_Classroom']

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function performs iterative forward inclusion (adding the excluded feature with p < 0.01) and backward elimination (removing the included feature with p > 0.05) until no further changes occur, yielding a concise set of significant predictors.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression on the selected features using statsmodels. The .summary() output reports coefficient estimates (USD impact per unit change), p -values (statistical significance), R², adjusted R², and diagnostic statistics.
Predictions on the held‑out test set allow us to compute R² (variance explained) and RMSE (prediction error), quantifying how well the model generalises to new schools.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A scatter plot of residuals versus predicted costs checks for non‑random patterns or heteroscedasticity, validating core OLS assumptions and highlighting any outliers.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Classroom (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to school finance data, we isolate the most influential drivers of classroom resource costs—such as student–teacher ratio, facility age, and socioeconomic indicators—while pruning noninformative variables. The resulting linear model strikes a strong balance between interpretability (few, significant predictors) and predictive performance (high test‑set R², low RMSE), equipping administrators with a transparent tool to forecast budgets and allocate resources where they’ll have the most significant impact.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook