Property Valuation Prediction using Stepwise Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Real estate professionals need to predict property market values accurately—a critical task for pricing and investment decisions.

In this property valuation prediction ML project, we’ll predict the House_Price of residential properties based on physical and locational attributes—such as transaction date, house age, distance to the nearest MRT station, number of nearby convenience stores, latitude/longitude, and land size—using stepwise regression.

By iteratively selecting the most significant predictors, we’ll build a concise linear model that balances interpretability with predictive accuracy—helping stakeholders understand which factors drive property values.

Libraries Required

import pandas as pd               # Data handling  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation  
import matplotlib.pyplot as plt   # Visualization

Dataset

Real Estate Valuation by UCI

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load and inspect a UCI‐derived Kaggle dataset of 414 transactions in New Taipei City, Taiwan, containing transaction date, house age, MRT distance, convenience store count, coordinates, and price per unit area.

# Block 1: Load UCI Real Estate Valuation dataset
url = "https://www.kaggle.com/datasets/dskagglemt/real-estate-valuation-by-uci/download"
df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Columns are renamed for clarity. We drop any incomplete records, then define X as numerical predictors (house age, distance, store count, latitude, longitude) and y as the target price (house_price).

Dataset features: transaction_date, house_age, distance_to_MRT, num_stores, latitude, longitude, land_lot_size, house_price

# Block 2: Clean & prepare features
# Rename for clarity
df = df.rename(columns={
    'X2 house age':'house_age',
    'X3 distance to the nearest MRT station':'distance_to_MRT',
    'X4 number of convenience stores':'num_stores',
    'X5 latitude':'latitude',
    'X6 longitude':'longitude',
    'Y house price of unit area':'house_price'
})

# Drop any rows with missing values
df = df.dropna(subset=[
    'transaction date','house_age','distance_to_MRT',
    'num_stores','latitude','longitude','house_price'
])

# Define predictors and target
X = df[[
    'house_age','distance_to_MRT','num_stores',
    'latitude','longitude'
]]
y = df['house_price']

# Optional: scale or transform features if needed

Train/Test Split

We hold out 20% of the data to evaluate generalization.

# Block 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function implements:

Forward inclusion: adds the excluded variable with the lowest p‑value < 0.01.
Backward elimination: removes the included variable with the highest p‑value > 0.05.

Iteration stops when no further changes occur, yielding a parsimonious set of significant predictors.

# Block 4: Forward–backward stepwise feature selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: consider adding each excluded predictor
        excluded = [col for col in X.columns if col not in included]
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:20} p-value {best_pval:.4f}")

        # Backward step: consider removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:20} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression on the selected features via statsmodels. The .summary() output details coefficient estimates, p‑values, R², adjusted R², and diagnostic statistics (AIC, F‑statistic).
We predict on held‑out data and compute Test R² (variance explained) and RMSE (prediction error scale) to quantify model performance.

# Block 5: Select features
selected = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residuals vs. predicted plot checks for non‑random dispersion or heteroscedasticity—validating OLS assumptions and model reliability.

# Block 6: Plot residuals vs. predicted to check assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--", color="gray")
plt.xlabel("Predicted House Price (Unit Area)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Price")
plt.show()

Summary

By applying stepwise regression to real‐estate transaction data, we isolate the key drivers of unit‐area price—such as proximity to MRT stations and number of nearby stores—while excluding less informative variables.

The resulting linear model balances interpretability (clear coefficient impacts and significance tests) with predictive accuracy (reasonable R² and low RMSE), providing real‑estate analysts and appraisers a transparent tool to forecast property values and guide pricing strategies.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook