Exam Score Improvement Prediction with Polynomial Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Educators and academic advisors want to forecast a student’s improvement in exam score—the difference between final and baseline assessments—based on early indicators such as baseline score, study time, test preparation participation, parental education level, and other socio‑demographic factors. The relationship between these predictors and score gains is inherently nonlinear—for instance, extra study time yields diminishing returns, and test preparation interacts with baseline aptitude. A naïve linear model underfits these dynamics, while a high‑degree polynomial without regularisation overfits. By applying Polynomial Regression to engineered features with Ridge regularisation, we can capture smooth curvatures and interactions to deliver reliable, interpretable forecasts of score improvement for targeted interventions.

Dataset

Student Performance in Exams

Step-by-Step Code Implementation

1. Libraries Required

import pandas as pd                              # data loading & handling  
import numpy as np                               # numerical operations  

import matplotlib.pyplot as plt                  # plotting  
import seaborn as sns                            # enhanced visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

2. Load Libraries & Data

import pandas as pd

# Adjust path to your environment
df = pd.read_csv("data/StudentsPerformance.csv")

# Preview relevant columns
df.head()[[
    'math score','reading score','writing score',
    'test preparation course','study time'
]]

3. Feature Engineering & Target Definition

  • Target engineering: define improvement as math score minus the average of reading/writing, isolating gain over baseline.
  • Feature mapping: convert categorical study‑time bands into approximate hours, and encode socio‑demographic factors with one‑hot encoding.
# Compute baseline as average of reading and writing, and final as math score
df['baseline'] = df[['reading score','writing score']].mean(axis=1)
df['improvement'] = df['math score'] - df['baseline']

# Encode study time into numeric (hours per week)
# assuming categories: '<2', '2–5', '5–10', '>10'
study_map = {'<2':1, '2–5':3.5, '5–10':7.5, '>10':12}
df['study_hours'] = df['study time'].map(study_map)

# Select features and target
X = df[[
    'baseline','study_hours',
    'test preparation course','gender','parental level of education'
]]
y = df['improvement']

4. Exploratory Data Analysis

import seaborn as sns
import matplotlib.pyplot as plt

# Check nonlinear trend: baseline vs improvement
sns.scatterplot(x='baseline', y='improvement', data=df, alpha=0.4)
plt.title("Baseline vs Score Improvement")
plt.xlabel("Baseline Score")
plt.ylabel("Improvement")
plt.show()

5. Build Polynomial Regression Pipeline

  • StandardScaler normalises numeric inputs so Ridge’s ℓ² penalty treats all features uniformly.
  • PolynomialFeatures augments inputs with squares and interactions (e.g., baseline², baseline×study_hours, study_hours²), capturing diminishing returns and synergy between aptitude and effort.
  • Ridge regression applies ℓ² regularisation to control overfitting in the expanded feature space.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge

# Separate numeric and categorical
num_cols = ['baseline','study_hours']
cat_cols = ['test preparation course','gender','parental level of education']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(drop='first'), cat_cols)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ridge', Ridge(random_state=42))
])

6. Train/Test Split & Hyperparameter Search

GridSearchCV tunes the polynomial degree (1–3) and regularisation strength α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE.

from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best parameters:", gs.best_params_)

7. Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.2f} points")
print(f"Test R²  : {r2:.3f}")

8. Inspect Key Polynomial Coefficients

Coefficient inspection reveals which nonlinear or interaction terms most strongly predict improvement—guiding resource allocation (e.g., more study hours yield greater gains for mid‑range baselines).

# Retrieve polynomial feature names
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(
    input_features=(
        num_cols +
        gs.best_estimator_.named_steps['prep']
          .named_transformers_['cat']
          .get_feature_names_out(cat_cols).tolist()
    )
)

# Retrieve Ridge coefficients
coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)

# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Improvement")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()

Summary

By embedding polynomial feature engineering and Ridge regularisation in a concise pipeline, we achieve:

  • Accurate nonlinear forecasts of exam score improvement (low RMSE, strong R²).
  • Controlled complexity, preventing overfitting while capturing key curvatures.
  • Interpretable drivers, highlighting the most impactful interactions—such as baseline×study_hours or baseline²—to inform targeted academic support strategies.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *