Student Grade Curve Prediction using Polynomial Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Educational institutions often adjust raw exam scores to fit a grade curve, mapping individual performance onto standard distributions. We want to build a model that predicts a student’s final grade (G3) based on their first (G1) and second (G2) period grades, along with key background factors (study time, failures, family support, etc.). The relationship between these inputs and the final grade is nonlinear—for instance, improvements from G1 to G2 may have diminishing returns—so simple linear regression underfits.
Polynomial Regression helps us capture these curvatures and interactions to accurately forecast final grades. This allows educators to anticipate outcomes and tailor interventions.
Libraries Required
import pandas as pd # for data handling import numpy as np # for numerical operations from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt # for plotting import seaborn as sns # for enhanced visualisation
Dataset
Step-by-Step Code Implementation
1. Load Data & Libraries
import pandas as pd
df = pd.read_csv("data/student-mat.csv") # math course file
# Display relevant columns
df[['G1','G2','G3','studytime','failures','schoolsup','famsup']].head()
2. Exploratory Data Analysis
import seaborn as sns, matplotlib.pyplot as plt
# Correlation heatmap for numeric predictors
sns.heatmap(df[['G1','G2','G3','studytime','failures']].corr(), annot=True, cmap='Blues')
plt.title("Numeric Feature Correlations")
plt.show()
3. Define Features & Target
# Select predictors and target numeric_features = ['G1','G2','studytime','failures'] categorical_features = ['schoolsup','famsup','paid','higher','internet'] X = df[numeric_features + categorical_features] y = df['G3'] # final grade
4. Build Pipeline with Polynomial Features
- OneHotEncoder converts categorical supports (schoolsup, famsup) into binary flags, then PolynomialFeatures can generate interactions between these flags and numeric scores.
- Polynomial Features augments scaled numeric inputs with their squares and pairwise products, capturing nonlinearities (e.g., the square of G2 captures diminishing returns).
- The Standard Scaler ensures that all numeric inputs contribute equally to the penalty.
- Ridge regression (ℓ² penalty) stabilises coefficient estimates in this expanded feature space, preventing overfitting.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer([
("num", StandardScaler(), numeric_features),
("cat", OneHotEncoder(drop='first'), categorical_features)
])
pipe = Pipeline([
("prep", preprocessor),
("poly", PolynomialFeatures(include_bias=False)),
("ridge", Ridge())
])
5. Train/Test Split & Hyperparameter Search
GridSearchCV tunes the polynomial degree (1–3) and Ridge α (10⁻³–10³) using 5‑fold CV to minimise RMSE on held‑out folds.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
"poly__degree": [1, 2, 3],
"ridge__alpha": np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5, scoring="neg_root_mean_squared_error",
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
6. Evaluate Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} grade points")
print(f"Test R² : {r2:.3f}")
7. Inspect Key Coefficients
# Extract names of polynomial features
poly = gs.best_estimator_.named_steps["poly"]
num_names = numeric_features
cat_names = gs.best_estimator_.named_steps["prep"] \
.named_transformers_["cat"] \
.get_feature_names_out(categorical_features)
feature_names = poly.get_feature_names_out(
input_features=np.hstack([num_names, cat_names])
)
# Retrieve coefficients
coefs = gs.best_estimator_.named_steps["ridge"].coef_
# Show top 10 by absolute value
import pandas as pd
coef_series = pd.Series(coefs, index=feature_names)
coef_series.abs().sort_values(ascending=False).head(10).plot(
kind="barh", figsize=(8,5)
)
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Influencing Final Grade")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()
Summary
By combining polynomial feature engineering with Ridge regularisation in a unified pipeline, we achieve:
- Accurate prediction of final course grade (G3) from earlier grades and support factors (low RMSE, high R²).
- Nonlinear modelling of educational progress (capturing curvature in the G1→G2→G3 relationship).
- Interpretable drivers—the largest coefficient terms (e.g., G2² or G1 × studytime) highlight critical levers for academic interventions and resource planning.