School Performance Prediction using Stepwise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Schools and teachers increasingly rely on data to identify factors affecting student success. In this project, we aim to develop a predictive model that estimates a student’s final exam score based on demographic, socio‑economic, and academic variables. Using stepwise regression, we will identify the most significant predictors and develop a parsimonious linear model that balances accuracy with interpretability. This can help schools assign resources, tailor interventions, and ultimately boost student performance.

Libraries Required

import pandas as pd               # Data manipulation
import numpy as np                # Numerical operations
import statsmodels.api as sm      # Statistical modeling
from sklearn.model_selection import train_test_split   # Data splitting
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation
import matplotlib.pyplot as plt   # Visualization

Dataset

Students Performance in Exams

Step-by-Step Code Implementation

Data Loading and Initial Inspection

We read the CSV directly from Kaggle, inspect its structure, and view summary statistics to understand variable distributions.

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/download"  
df = pd.read_csv(url)

# Quick look
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Categorical features (gender, race/ethnicity, parental education, lunch type, test preparation) are one‑hot encoded. Missing values, if any, are dropped to simplify the pipeline. We split the data into training and testing sets to evaluate generalisation.

# Block 2: Encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

# Handle missing values (if any)
df_encoded = df_encoded.dropna()

# Define features and target
X = df_encoded.drop("math score", axis=1)
y = df_encoded["math score"]

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

We define a function that combines forward inclusion (adding the most significant variable under threshold_in) with backward elimination (removing the least significant variable over threshold_out). This loop continues until no variables meet the criteria for addition or removal.

# Block 3: Forward–backward stepwise function
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out=0.05, 
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False
        # forward step
        excluded = list(set(X.columns) - set(included))
        new_pval = pd.Series(index=excluded, dtype=float)
        for new_col in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_col]]))).fit()
            new_pval[new_col] = model.pvalues[new_col]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed = True
            if verbose:
                print(f"Add  {best_feature:30} with p-value {best_pval:.6}")

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvalues.max()
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
            if verbose:
                print(f"Drop {worst_feature:30} with p-value {worst_pval:.6}")
        if not changed:
            break
    return included

Model Building & Evaluation

Model Fitting: Using the selected features, we fit an Ordinary Least Squares model via statsmodels. We review the detailed regression summary to check coefficients, p-values, and overall fit.
Evaluation: We predict on the test set and compute R² (proportion of variance explained) and RMSE (root-mean-square error) to quantify predictive accuracy.

# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)

# Fit final model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()

# Summary of the regression
print(model.summary())

# Predict and evaluate
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("R² on test set:", r2_score(y_test, y_pred))
print("RMSE on test set:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Analysis

A residual plot helps verify assumptions of homoscedasticity (constant variance) and identify potential outliers.

# Block 5: Plot residuals
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle='--')
plt.xlabel("Predicted Score")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

Summary

By applying stepwise regression to the “Students’ Performance in Exams” dataset, we isolate the most influential factors affecting math scores—such as parental education level, completion of test preparation, and lunch type—while discarding redundant predictors. The final OLS model balances simplicity and predictive power, achieving a robust R² and low RMSE on held‑out data. This approach offers educators an interpretable tool to pinpoint interventions and allocate resources where they’ll have the most tremendous impact on student outcomes.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook