Student Performance Prediction with Lasso Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Schools collect abundant demographic and academic data, yet teachers seldom have a concise, data‑driven way to pinpoint which factors truly affect exam outcomes. This project builds a Lasso‑regularised linear model that:

  • Forecasts a pupil’s average exam score (0‑100) before test day, using easy‑to‑capture attributes such as gender, lunch type, parental education, and prior test‑prep.
  • Shrinks weak predictors to zero, revealing the handful of levers teachers can address first—without wading through every variable in the report card.

Because Lasso’s ℓ1 penalty balances fit and sparsity, the resulting model stays interpretable for counsellors and policymakers.

Libraries Required

Purpose Library
Data handling pandas, numpy
Visuals matplotlib, seaborn
ML pipeline scikit‑learnColumnTransformer, OneHotEncoder, StandardScaler, Pipeline, Lasso, GridSearchCV
Metrics mean_squared_error, r2_score

Dataset

Students’ Performance in Exams

Step-by-Step Code Implementation

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

Download & load dataset

1,000 high‑school records with scores in maths, reading, writing and six categorical predictors.

# one‑time shell command (Kaggle API key required):
# kaggle datasets download -d spscientist/students-performance-in-exams -p data --unzip

data = pd.read_csv("data/StudentsPerformance.csv")   # 1 000 rows, 8 columns

Initial inspection & quick visuals

print(data.head())
print(data.isna().sum())                 # no missing values
sns.pairplot(data[['math score','reading score','writing score']]); plt.show()

Target & feature engineering

The mean of three subject marks (total_score), yielding a single 0‑100 metric.

# Mean of three subject scores as overall performance
data['total_score'] = data[['math score','reading score','writing score']].mean(axis=1)

X = data.drop(columns=['math score','reading score','writing score','total_score'])
y = data['total_score']

Pre‑processing pipeline

ColumnTransformer one‑hot‑encodes categorical features while StandardScaler (ready for any numeric inputs) keeps units comparable so Lasso’s penalty treats each predictor fairly.

cat_cols = X.select_dtypes(include='object').columns
num_cols = X.select_dtypes(exclude='object').columns  # may be empty here

preprocess = ColumnTransformer(
    [('cats', OneHotEncoder(drop='first', sparse=False), cat_cols),
     ('nums', StandardScaler(), num_cols)],
    remainder='passthrough')

Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=data['test preparation course'])

Build & tune Lasso model

Grid search across α values finds the right balance between bias and sparsity. Five‑fold cross‑validation avoids lucky or unlucky splits.

pipe = Pipeline([
    ('prep', preprocess),
    ('model', Lasso(max_iter=10_000, random_state=42))
])

param_grid = {'model__alpha': np.logspace(-2, 1, 20)}   # 0.01 → 10
grid = GridSearchCV(pipe, param_grid, cv=5,
                    scoring='neg_root_mean_squared_error')
grid.fit(X_train, y_train)

print("Best alpha (shrinkage factor):", grid.best_params_['model__alpha'])

Evaluate on the hold‑out set

RMSE keeps error in “score” units; R2R^{2} indicates how much variance in exam results is captured.

y_pred = grid.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f} | R²: {r2:.3f}")

Interpret feature importance

Non‑zero bars highlight actionable drivers (e.g., completed test‑prep course or parental education level); zeroed bars signal negligible influence and can be ignored in future data collection.

ohe = grid.best_estimator_.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
all_features = np.concatenate([ohe_names, num_cols])

coeffs = grid.best_estimator_.named_steps['model'].coef_
imp = pd.Series(coeffs, index=all_features).sort_values(key=abs, ascending=False)

plt.figure(figsize=(9,6))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Lasso Coefficients')
plt.xlabel('Coefficient value')
plt.show()

Summary

This compact pipeline shows how Lasso regression turns raw exam‑record CSVs into:

  • A forward‑looking score prediction is accurate within a few points of RMSE.
  • A ranked list of performance drivers, allowing educators to focus on the factors that matter most.

Because preprocessing, tuning, and modelling sit within a single Pipeline, updating the model with a new semester’s data is a single re‑fit—no manual wrangling required.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *