Exam Preparation Time Prediction using Linear Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Educators often advise students to “study more,” yet few can translate that advice into the number of hours an individual needs to achieve a desired score.
Using a public “study‑hours ⇄ exam‑score” dataset, we fit a simple linear‑regression model that captures the relationship between preparation time and marks. The resulting equation lets us work forwards (predict the score a student might obtain for a given number of study hours) or backwards (estimate how many hours a student should plan to reach a target grade). The model serves as a transparent baseline before exploring richer, personalised recommendations.
Libraries Required
- pandas # tabular wrangling
- numpy # numerical helpers
- matplotlib.pyplot# quick scatter & fit line
- scikit‑learn # model, split, metrics
- joblib # save the trained model
Dataset Link
Step by Step Code Implementation
Why linear regression? Within normal ranges, exam scores often rise roughly proportionally with extra study time; a straight‑line fit supplies an interpretable first‑order model.
1. Import essentials
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
df = pd.read_csv("study_hours_vs_exam_scores.csv")
print(df.head())
# Expected columns: 'Hours' (float) and 'Scores' (percentage)
3. Basic sanity check
Visual overlay confirms model sanity at a glance; large deviations or a curved pattern would signal the need for polynomial terms or a different algorithm.
# Plot raw relationship
plt.scatter(df['Hours'], df['Scores'])
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.title("Study Hours vs Exam Score")
plt.show()
4. Prepare features & label.
X = df[['Hours']] # 2‑D array expected by scikit‑learn y = df['Scores']
5. Train‑test split
Train‑test split keeps 20 % of the records unseen during fitting, giving an honest estimate of predictive performance.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
6. Model training
linreg = LinearRegression() linreg.fit(X_train, y_train)
7. Evaluation
R² and MAE tell different stories—how much variance we capture and the typical absolute error in score points—helping tutors decide if the rule of thumb is actionable.
y_pred = linreg.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} percentage points")
8. Inspect the fitted line.
Single‑feature simplicity makes the maths clear: Score = m × Hours + b. Once m (slope) and b (intercept) are learned, you can invert the equation to find the required hours for any realistic score goal.
coef = linreg.coef_[0] # slope
inter = linreg.intercept_ # y‑intercept
print(f"Score = {coef:.2f} × Hours + {inter:.2f}")
# overlay regression line on scatterplot
plt.scatter(df['Hours'], df['Scores'], label="Actual")
x_line = np.linspace(0, df['Hours'].max(), 100).reshape(-1, 1)
plt.plot(x_line, linreg.predict(x_line), color='red', label="Fitted line")
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.legend()
plt.show()
9. Utility helper – predict hours for a target score.
Helper function hours_needed() wraps the inversion step so web apps or dashboards can surface personalised study‑time recommendations instantly.
def hours_needed(target_score):
"""
Estimate study hours required for a desired percentage.
Returns None if the target is unrealistic for the model.
"""
if coef == 0: # safety check
return None
return max((target_score - inter) / coef, 0)
print(f"≈ Hours needed for 85 %: {hours_needed(85):.1f}")
10. Persist the model
Model persistence with joblib allows the same coefficients to serve real‑time advice in a classroom portal without retraining.
joblib.dump(linreg, "exam_prep_time_linreg.pkl")
Summary
This compact workflow turns a transparent linear fit into a practical calculator for study planning. Teachers can plug in a target score and hand back a round‑number estimate of preparation hours, backed by real data instead of guesswork. While individual learning rates vary, starting with this interpretable baseline establishes trust, highlights outliers for further mentoring, and lays the groundwork for more nuanced models that fold in subject difficulty, prior knowledge, and learning style.