School Performance Prediction using Stepwise Regression in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
Schools and teachers increasingly rely on data to identify factors affecting student success. In this project, we aim to develop a predictive model that estimates a student’s final exam score based on demographic, socio‑economic, and academic variables. Using stepwise regression, we will identify the most significant predictors and develop a parsimonious linear model that balances accuracy with interpretability. This can help schools assign resources, tailor interventions, and ultimately boost student performance.
Libraries Required
import pandas as pd # Data manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Statistical modeling from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading and Initial Inspection
We read the CSV directly from Kaggle, inspect its structure, and view summary statistics to understand variable distributions.
# Block 1: Load dataset url = "https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/download" df = pd.read_csv(url) # Quick look print(df.head()) print(df.info()) print(df.describe())
Data Preprocessing
Categorical features (gender, race/ethnicity, parental education, lunch type, test preparation) are one‑hot encoded. Missing values, if any, are dropped to simplify the pipeline. We split the data into training and testing sets to evaluate generalisation.
# Block 2: Encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)
# Handle missing values (if any)
df_encoded = df_encoded.dropna()
# Define features and target
X = df_encoded.drop("math score", axis=1)
y = df_encoded["math score"]
# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
We define a function that combines forward inclusion (adding the most significant variable under threshold_in) with backward elimination (removing the least significant variable over threshold_out). This loop continues until no variables meet the criteria for addition or removal.
# Block 3: Forward–backward stepwise function
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# forward step
excluded = list(set(X.columns) - set(included))
new_pval = pd.Series(index=excluded, dtype=float)
for new_col in excluded:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_col]]))).fit()
new_pval[new_col] = model.pvalues[new_col]
best_pval = new_pval.min()
if best_pval < threshold_in:
best_feature = new_pval.idxmin()
included.append(best_feature)
changed = True
if verbose:
print(f"Add {best_feature:30} with p-value {best_pval:.6}")
# backward step
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
pvalues = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvalues.max()
if worst_pval > threshold_out:
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
changed = True
if verbose:
print(f"Drop {worst_feature:30} with p-value {worst_pval:.6}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: Using the selected features, we fit an Ordinary Least Squares model via statsmodels. We review the detailed regression summary to check coefficients, p-values, and overall fit.
- Evaluation: We predict on the test set and compute R² (proportion of variance explained) and RMSE (root-mean-square error) to quantify predictive accuracy.
# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)
# Fit final model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
# Summary of the regression
print(model.summary())
# Predict and evaluate
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("R² on test set:", r2_score(y_test, y_pred))
print("RMSE on test set:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Analysis
A residual plot helps verify assumptions of homoscedasticity (constant variance) and identify potential outliers.
# Block 5: Plot residuals
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle='--')
plt.xlabel("Predicted Score")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()
Summary
By applying stepwise regression to the “Students’ Performance in Exams” dataset, we isolate the most influential factors affecting math scores—such as parental education level, completion of test preparation, and lunch type—while discarding redundant predictors. The final OLS model balances simplicity and predictive power, achieving a robust R² and low RMSE on held‑out data. This approach offers educators an interpretable tool to pinpoint interventions and allocate resources where they’ll have the most tremendous impact on student outcomes.