Student Test Score Prediction using Linear Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Predicting student performance on standardised tests can help educators, institutes, and colleges identify candidates who need extra support. In this machine learning project, we develop a simple linear regression model that takes a student’s number of study hours as input and predicts their test scores. By fitting a line to past student data, our model will reveal the relationship between study time and performance, enabling us to forecast scores for new students and inform their study plans.
Libraries Required
- pandas: for data loading and manipulation
- numpy: for numerical operations
- matplotlib: for visualizing data and results
- scikit-learn: for building and evaluating the linear regression model
Dataset Link
Step by Step Code Implementation
1. Importing Libraries
We import pandas and numpy for data handling, matplotlib for plotting, and scikit-learn modules for model creation and evaluation.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score
2. Loading the Dataset
The CSV file contains two columns: ‘Hours’ (the number of hours studied) and ‘Score’ (the percentage score). We load it into a DataFrame to inspect and process.
# Assume data.csv has columns: 'Hours' and 'Score'
data = pd.read_csv('data.csv')
3. Exploratory Analysis
A scatter plot illustrates how scores tend to increase with the number of study hours. This visual check confirms we can apply a simple linear model.
# Display first few rows
print(data.head())
# Scatter plot to visualize relationship
plt.scatter(data['Hours'], data['Score'])ml
plt.title('Study Hours vs Test Score')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.show()
4. Feature and Label Selection
We select ‘Hours’ as the independent variable (feature) and ‘Score’ as the dependent variable (label). Even though it’s a single column, we wrap ‘Hours’ in double brackets so that it remains a DataFrame.
X = data[['Hours']] # Feature matrix (hours studied) y = data['Score'] # Target vector (test scores)
5. Splitting into Training and Test Sets
To assess generalization, we reserve 20% of data for testing. Using random_state ensures reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
6. Training the Linear Regression Model
We instantiate LinearRegression and call the fit() method on the training data. The model learns the best-fitting line parameters (slope and intercept).
model = LinearRegression() model.fit(X_train, y_train)
7. Making Predictions
predict() applies the learned relationship to hours in the test set, producing estimated scores.
y_pred = model.predict(X_test)
8. Evaluating Model Performance
- Mean Squared Error (MSE) measures average squared difference between actual and predicted scores; lower is better.
- R² Score indicates the proportion of variance in test scores explained by our model; values closer to 1 signal a strong fit.
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
9. Visualizing the Regression Line
We overlay the regression line on training data, illustrating how well the line captures the data trend.
# Plot training data and regression line
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(X_train, model.predict(X_train), color='red', label='Regression line')
plt.title('Linear Regression Fit')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.legend()
plt.show()
Summary
In this project, we developed a precise and interpretable linear regression model to predict student test scores based on the number of hours studied. We saw that even a simple model can provide valuable insights: educators can estimate expected performance and identify students at risk of low scores. With an R² of nearly 0.9 and a low MSE (values will vary based on the data), the linear approach proves effective for this two-variable scenario. Future extensions could incorporate additional features—such as attendance or prior grades—or explore polynomial regression for nonlinear relationships, further enhancing predictive power.