Academic Program Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Universities incur varying per‑student costs across academic programs—driven by factors such as total enrollment, instructional expenditure, support services spending, locale characteristics, and year‑to‑year funding changes. Accurately forecasting expenditure per student enables administrators to budget effectively, benchmark programs, and allocate resources equitably. In this project, we’ll predict the cost per student for U.S. public school districts (as a proxy for program cost) using district‑level finance and enrollment data, applying stepwise regression to isolate the most significant drivers and build an interpretable linear model.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

U.S. Educational Finances

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a comprehensive district‑level finance dataset covering revenues and expenditures from 1992 to 2016, inspecting key fields and checking for missing values.

# Block 1: Load dataset
# U.S. Educational Finances by school district (1992–2016) :contentReference[oaicite:1]{index=1}
df = pd.read_csv("us_educational_finances.csv")

print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We compute the target Cost_per_Student by dividing total expenditures by enrollment. Categorical Locale is one‑hot encoded to capture urban/suburban/rural differences. Key predictors include instructional and support services spending, enrollment count, year, and locale dummies. We split into 80% training and 20% test sets.

Key columns include: Total_Expenditure, Instruction_Expenditure, Support_Services_Expenditure, Enrollment, Locale (urban/rural), and Year.

# Block 2: Compute cost per student and encode categoricals
df = df.dropna(subset=[
    'Total_Expenditure','Instruction_Expenditure',
    'Support_Services_Expenditure','Enrollment','Locale','Year'
])

# Define target: cost per student
df['Cost_per_Student'] = df['Total_Expenditure'] / df['Enrollment']

# One‑hot encode locale (assuming values like 'Urban', 'Rural', 'Suburban')
df_enc = pd.get_dummies(df, columns=['Locale'], drop_first=True)

# Select predictors
X = df_enc[[
    'Instruction_Expenditure', 
    'Support_Services_Expenditure',
    'Enrollment',
    'Year'
] + [col for col in df_enc.columns if col.startswith('Locale_')]]
y = df_enc['Cost_per_Student']

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our stepwise_selection function alternates:

Forward inclusion: adds the excluded feature with the lowest p -value (< 0.01),
Backward elimination: removes the included feature with the highest p-value (>0.05) until no further variables meet the criteria, yielding a parsimonious set of significant predictors.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out=0.05, 
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            pval = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
            new_pvals[col] = pval
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.4f}")

        # Backward step: test each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Model Fitting: We fit an Ordinary Least Squares regression on the selected features. The .summary() output provides coefficient estimates (USD impact per unit change), p -values (significance), R², adjusted R², and diagnostic statistics (AIC, F‑statistic), offering a transparent interpretation of cost drivers.
Evaluation: Predictions on the held‑out test set allow calculation of R² (variance explained) and RMSE (root‑mean‑square error), quantifying out‑of‑sample predictive accuracy.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residuals‑vs‑predicted plot checks for non‑random patterns or heteroscedasticity, validating linear regression assumptions and highlighting any outliers or model misspecification.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Student (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to U.S. district finance data, we isolate the most influential drivers of per‑student expenditure—such as instructional spending, support service costs, enrollment, and locale—while pruning non‑informative variables. The resulting linear model balances interpretability (few, statistically significant predictors) with strong predictive performance (high R², low RMSE), equipping education administrators and policymakers with a transparent tool for forecasting program costs and planning budgets effectively.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook