Exam Score Improvement Prediction with Polynomial Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Educators and academic advisors want to forecast a student’s improvement in exam score—the difference between final and baseline assessments—based on early indicators such as baseline score, study time, test preparation participation, parental education level, and other socio‑demographic factors. The relationship between these predictors and score gains is inherently nonlinear—for instance, extra study time yields diminishing returns, and test preparation interacts with baseline aptitude. A naïve linear model underfits these dynamics, while a high‑degree polynomial without regularisation overfits. By applying Polynomial Regression to engineered features with Ridge regularisation, we can capture smooth curvatures and interactions to deliver reliable, interpretable forecasts of score improvement for targeted interventions.
Dataset
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Load Libraries & Data
import pandas as pd
# Adjust path to your environment
df = pd.read_csv("data/StudentsPerformance.csv")
# Preview relevant columns
df.head()[[
'math score','reading score','writing score',
'test preparation course','study time'
]]
3. Feature Engineering & Target Definition
- Target engineering: define improvement as math score minus the average of reading/writing, isolating gain over baseline.
- Feature mapping: convert categorical study‑time bands into approximate hours, and encode socio‑demographic factors with one‑hot encoding.
# Compute baseline as average of reading and writing, and final as math score
df['baseline'] = df[['reading score','writing score']].mean(axis=1)
df['improvement'] = df['math score'] - df['baseline']
# Encode study time into numeric (hours per week)
# assuming categories: '<2', '2–5', '5–10', '>10'
study_map = {'<2':1, '2–5':3.5, '5–10':7.5, '>10':12}
df['study_hours'] = df['study time'].map(study_map)
# Select features and target
X = df[[
'baseline','study_hours',
'test preparation course','gender','parental level of education'
]]
y = df['improvement']
4. Exploratory Data Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Check nonlinear trend: baseline vs improvement
sns.scatterplot(x='baseline', y='improvement', data=df, alpha=0.4)
plt.title("Baseline vs Score Improvement")
plt.xlabel("Baseline Score")
plt.ylabel("Improvement")
plt.show()
5. Build Polynomial Regression Pipeline
- StandardScaler normalises numeric inputs so Ridge’s ℓ² penalty treats all features uniformly.
- PolynomialFeatures augments inputs with squares and interactions (e.g., baseline², baseline×study_hours, study_hours²), capturing diminishing returns and synergy between aptitude and effort.
- Ridge regression applies ℓ² regularisation to control overfitting in the expanded feature space.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
# Separate numeric and categorical
num_cols = ['baseline','study_hours']
cat_cols = ['test preparation course','gender','parental level of education']
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(drop='first'), cat_cols)
])
pipe = Pipeline([
('prep', preprocessor),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
6. Train/Test Split & Hyperparameter Search
GridSearchCV tunes the polynomial degree (1–3) and regularisation strength α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE.
from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
7. Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} points")
print(f"Test R² : {r2:.3f}")
8. Inspect Key Polynomial Coefficients
Coefficient inspection reveals which nonlinear or interaction terms most strongly predict improvement—guiding resource allocation (e.g., more study hours yield greater gains for mid‑range baselines).
# Retrieve polynomial feature names
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(
input_features=(
num_cols +
gs.best_estimator_.named_steps['prep']
.named_transformers_['cat']
.get_feature_names_out(cat_cols).tolist()
)
)
# Retrieve Ridge coefficients
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Improvement")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
By embedding polynomial feature engineering and Ridge regularisation in a concise pipeline, we achieve:
- Accurate nonlinear forecasts of exam score improvement (low RMSE, strong R²).
- Controlled complexity, preventing overfitting while capturing key curvatures.
- Interpretable drivers, highlighting the most impactful interactions—such as baseline×study_hours or baseline²—to inform targeted academic support strategies.