Academic Program Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Universities incur varying per‑student costs across academic programs—driven by factors such as total enrollment, instructional expenditure, support services spending, locale characteristics, and year‑to‑year funding changes. Accurately forecasting expenditure per student enables administrators to budget effectively, benchmark programs, and allocate resources equitably. In this project, we’ll predict the cost per student for U.S. public school districts (as a proxy for program cost) using district‑level finance and enrollment data, applying stepwise regression to isolate the most significant drivers and build an interpretable linear model.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a comprehensive district‑level finance dataset covering revenues and expenditures from 1992 to 2016, inspecting key fields and checking for missing values.
# Block 1: Load dataset
# U.S. Educational Finances by school district (1992–2016) :contentReference[oaicite:1]{index=1}
df = pd.read_csv("us_educational_finances.csv")
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
We compute the target Cost_per_Student by dividing total expenditures by enrollment. Categorical Locale is one‑hot encoded to capture urban/suburban/rural differences. Key predictors include instructional and support services spending, enrollment count, year, and locale dummies. We split into 80% training and 20% test sets.
Key columns include: Total_Expenditure, Instruction_Expenditure, Support_Services_Expenditure, Enrollment, Locale (urban/rural), and Year.
# Block 2: Compute cost per student and encode categoricals
df = df.dropna(subset=[
'Total_Expenditure','Instruction_Expenditure',
'Support_Services_Expenditure','Enrollment','Locale','Year'
])
# Define target: cost per student
df['Cost_per_Student'] = df['Total_Expenditure'] / df['Enrollment']
# One‑hot encode locale (assuming values like 'Urban', 'Rural', 'Suburban')
df_enc = pd.get_dummies(df, columns=['Locale'], drop_first=True)
# Select predictors
X = df_enc[[
'Instruction_Expenditure',
'Support_Services_Expenditure',
'Enrollment',
'Year'
] + [col for col in df_enc.columns if col.startswith('Locale_')]]
y = df_enc['Cost_per_Student']
# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
Our stepwise_selection function alternates:
- Forward inclusion: adds the excluded feature with the lowest p -value (< 0.01),
- Backward elimination: removes the included feature with the highest p-value (>0.05) until no further variables meet the criteria, yielding a parsimonious set of significant predictors.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
pval = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
new_pvals[col] = pval
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.4f}")
# Backward step: test each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: We fit an Ordinary Least Squares regression on the selected features. The .summary() output provides coefficient estimates (USD impact per unit change), p -values (significance), R², adjusted R², and diagnostic statistics (AIC, F‑statistic), offering a transparent interpretation of cost drivers.
- Evaluation: Predictions on the held‑out test set allow calculation of R² (variance explained) and RMSE (root‑mean‑square error), quantifying out‑of‑sample predictive accuracy.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residuals‑vs‑predicted plot checks for non‑random patterns or heteroscedasticity, validating linear regression assumptions and highlighting any outliers or model misspecification.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Student (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
By applying stepwise regression to U.S. district finance data, we isolate the most influential drivers of per‑student expenditure—such as instructional spending, support service costs, enrollment, and locale—while pruning non‑informative variables. The resulting linear model balances interpretability (few, statistically significant predictors) with strong predictive performance (high R², low RMSE), equipping education administrators and policymakers with a transparent tool for forecasting program costs and planning budgets effectively.