Student Performance Prediction with Lasso Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Schools collect abundant demographic and academic data, yet teachers seldom have a concise, data‑driven way to pinpoint which factors truly affect exam outcomes. This project builds a Lasso‑regularised linear model that:
- Forecasts a pupil’s average exam score (0‑100) before test day, using easy‑to‑capture attributes such as gender, lunch type, parental education, and prior test‑prep.
- Shrinks weak predictors to zero, revealing the handful of levers teachers can address first—without wading through every variable in the report card.
Because Lasso’s ℓ1 penalty balances fit and sparsity, the resulting model stays interpretable for counsellors and policymakers.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Visuals | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, Pipeline, Lasso, GridSearchCV |
| Metrics | mean_squared_error, r2_score |
Dataset
Students’ Performance in Exams
Step-by-Step Code Implementation
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
Download & load dataset
1,000 high‑school records with scores in maths, reading, writing and six categorical predictors.
# one‑time shell command (Kaggle API key required):
# kaggle datasets download -d spscientist/students-performance-in-exams -p data --unzip
data = pd.read_csv("data/StudentsPerformance.csv") # 1 000 rows, 8 columns
Initial inspection & quick visuals
print(data.head()) print(data.isna().sum()) # no missing values sns.pairplot(data[['math score','reading score','writing score']]); plt.show()
Target & feature engineering
The mean of three subject marks (total_score), yielding a single 0‑100 metric.
# Mean of three subject scores as overall performance data['total_score'] = data[['math score','reading score','writing score']].mean(axis=1) X = data.drop(columns=['math score','reading score','writing score','total_score']) y = data['total_score']
Pre‑processing pipeline
ColumnTransformer one‑hot‑encodes categorical features while StandardScaler (ready for any numeric inputs) keeps units comparable so Lasso’s penalty treats each predictor fairly.
cat_cols = X.select_dtypes(include='object').columns
num_cols = X.select_dtypes(exclude='object').columns # may be empty here
preprocess = ColumnTransformer(
[('cats', OneHotEncoder(drop='first', sparse=False), cat_cols),
('nums', StandardScaler(), num_cols)],
remainder='passthrough')
Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=data['test preparation course'])
Build & tune Lasso model
Grid search across α values finds the right balance between bias and sparsity. Five‑fold cross‑validation avoids lucky or unlucky splits.
pipe = Pipeline([
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-2, 1, 20)} # 0.01 → 10
grid = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error')
grid.fit(X_train, y_train)
print("Best alpha (shrinkage factor):", grid.best_params_['model__alpha'])
Evaluate on the hold‑out set
RMSE keeps error in “score” units; R2R^{2} indicates how much variance in exam results is captured.
y_pred = grid.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f} | R²: {r2:.3f}")
Interpret feature importance
Non‑zero bars highlight actionable drivers (e.g., completed test‑prep course or parental education level); zeroed bars signal negligible influence and can be ignored in future data collection.
ohe = grid.best_estimator_.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
all_features = np.concatenate([ohe_names, num_cols])
coeffs = grid.best_estimator_.named_steps['model'].coef_
imp = pd.Series(coeffs, index=all_features).sort_values(key=abs, ascending=False)
plt.figure(figsize=(9,6))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Lasso Coefficients')
plt.xlabel('Coefficient value')
plt.show()
Summary
This compact pipeline shows how Lasso regression turns raw exam‑record CSVs into:
- A forward‑looking score prediction is accurate within a few points of RMSE.
- A ranked list of performance drivers, allowing educators to focus on the factors that matter most.
Because preprocessing, tuning, and modelling sit within a single Pipeline, updating the model with a new semester’s data is a single re‑fit—no manual wrangling required.