House Size Price Prediction using Linear Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
In real estate, the size of a house has a significant impact on its market value. Buyers, sellers, and agents need to understand this effect. In this machine learning project, we will construct a linear regression model to predict a home’s selling price based solely on its living area (square footage). By fitting a line to sales data based on previous records, our model will quantify the contribution of each additional square foot to the price. This will enable users to estimate property values quickly and make informed decisions.
Libraries Required
- Pandas: for data ingestion and manipulation
- NumPy: for numerical computations
- Matplotlib: for plotting data and results
- Scikit-learn: for model training, prediction, and evaluation
Dataset Link
Step by Step Code Implementation
1. Importing Libraries
We import pandas and numpy to manage and process data arrays, matplotlib to visualize relationships, and scikit-learn modules for model creation and metrics.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
2. Loading the Dataset
The CSV file (“house_data.csv”) contains two columns:
- Size: living area in square feet
- Price: selling price in US dollars. We load the data into a DataFrame for inspection and analysis.
# Assume 'house_data.csv' contains columns 'Size' (in sq ft) and 'Price' (in USD)
data = pd.read_csv('house_data.csv')
3. Exploratory Data Analysis
A scatter plot helps us verify that price generally increases with size, indicating a linear trend suitable for regression.
# Quick glimpse of data
print(data.head())
# Scatter plot: Size vs Price
plt.scatter(data['Size'], data['Price'])
plt.title('House Size vs. Price')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (USD)')
plt.show()
4. Defining Features and Target
We designate the Size column as our independent variable (feature) and Price as the dependent variable (target). Wrapping Size in double brackets retains it as a DataFrame.
X = data[['Size']] # Feature: house size y = data['Price'] # Target: sale price
5. Splitting into Training and Test Sets
To gauge how well our model generalises, we reserve 25% of the dataset for testing. Setting random_state=0 ensures our results are reproducible.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0
)
6. Training the Linear Regression Model
Instantiating LinearRegression, we call the fit() method on the training data, enabling the model to learn the coefficients (slope) and intercept, thereby defining the best-fitting line.
model = LinearRegression() model.fit(X_train, y_train)
7. Predicting on the Test Set
Using predict(), we generate price estimates for homes in the test set based on their sizes.
y_pred = model.predict(X_test)
8. Evaluating Model Performance
- MAE (Mean Absolute Error): average absolute difference between predicted and actual prices; lower values are better and easier to interpret in dollars.
- MSE (Mean Squared Error): penalizes larger errors more heavily by squaring differences.
- R² Score: proportion of variance in prices explained by size; values nearer 1.0 indicate a strong linear relationship.
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: ${mae:,.2f}")
print(f"Mean Squared Error: ${mse:,.2f}")
print(f"R² Score: {r2:.3f}")
9. Visualizing Regression Line
We plot the regression line atop training data points to visually assess the model’s fit and identify any systematic deviations.
plt.scatter(X_train, y_train, color='lightgray', label='Training data')
plt.plot(X_train, model.predict(X_train), color='blue', linewidth=2, label='Fit line')
plt.title('Linear Regression Fit on Training Data')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (USD)')
plt.legend()
plt.show()
Summary
This project demonstrates how we can predict house prices solely from size using a linear regression model. In practice, this approach provides a transparent baseline; professionals can customise it with bigger datasets or explore polynomial and regularised regression methods to capture non‑linear dynamics.