Student Test Score Prediction using Linear Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Predicting student performance on standardised tests can help educators, institutes, and colleges identify candidates who need extra support. In this machine learning project, we develop a simple linear regression model that takes a student’s number of study hours as input and predicts their test scores. By fitting a line to past student data, our model will reveal the relationship between study time and performance, enabling us to forecast scores for new students and inform their study plans.

Libraries Required

pandas: for data loading and manipulation
numpy: for numerical operations
matplotlib: for visualizing data and results
scikit-learn: for building and evaluating the linear regression model

Dataset Link

Student Score Dataset

Step by Step Code Implementation

1. Importing Libraries

We import pandas and numpy for data handling, matplotlib for plotting, and scikit-learn modules for model creation and evaluation.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

2. Loading the Dataset

The CSV file contains two columns: ‘Hours’ (the number of hours studied) and ‘Score’ (the percentage score). We load it into a DataFrame to inspect and process.

# Assume data.csv has columns: 'Hours' and 'Score'
data = pd.read_csv('data.csv')

3. Exploratory Analysis

A scatter plot illustrates how scores tend to increase with the number of study hours. This visual check confirms we can apply a simple linear model.

# Display first few rows
print(data.head())

# Scatter plot to visualize relationship
plt.scatter(data['Hours'], data['Score'])ml 
plt.title('Study Hours vs Test Score')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.show()

4. Feature and Label Selection

We select ‘Hours’ as the independent variable (feature) and ‘Score’ as the dependent variable (label). Even though it’s a single column, we wrap ‘Hours’ in double brackets so that it remains a DataFrame.

X = data[['Hours']]      # Feature matrix (hours studied)
y = data['Score']        # Target vector (test scores)

5. Splitting into Training and Test Sets

To assess generalization, we reserve 20% of data for testing. Using random_state ensures reproducibility.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

6. Training the Linear Regression Model

We instantiate LinearRegression and call the fit() method on the training data. The model learns the best-fitting line parameters (slope and intercept).

model = LinearRegression()
model.fit(X_train, y_train)

7. Making Predictions

predict() applies the learned relationship to hours in the test set, producing estimated scores.

y_pred = model.predict(X_test)

8. Evaluating Model Performance

Mean Squared Error (MSE) measures average squared difference between actual and predicted scores; lower is better.
R² Score indicates the proportion of variance in test scores explained by our model; values closer to 1 signal a strong fit.

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

9. Visualizing the Regression Line

We overlay the regression line on training data, illustrating how well the line captures the data trend.

# Plot training data and regression line
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(X_train, model.predict(X_train), color='red', label='Regression line')
plt.title('Linear Regression Fit')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.legend()
plt.show()

Summary

In this project, we developed a precise and interpretable linear regression model to predict student test scores based on the number of hours studied. We saw that even a simple model can provide valuable insights: educators can estimate expected performance and identify students at risk of low scores. With an R² of nearly 0.9 and a low MSE (values will vary based on the data), the linear approach proves effective for this two-variable scenario. Future extensions could incorporate additional features—such as attendance or prior grades—or explore polynomial regression for nonlinear relationships, further enhancing predictive power.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook