Property Valuation Prediction using Stepwise Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Real estate professionals need to predict property market values accurately—a critical task for pricing and investment decisions.
In this property valuation prediction ML project, we’ll predict the House_Price of residential properties based on physical and locational attributes—such as transaction date, house age, distance to the nearest MRT station, number of nearby convenience stores, latitude/longitude, and land size—using stepwise regression.
By iteratively selecting the most significant predictors, we’ll build a concise linear model that balances interpretability with predictive accuracy—helping stakeholders understand which factors drive property values.
Libraries Required
import pandas as pd # Data handling import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load and inspect a UCI‐derived Kaggle dataset of 414 transactions in New Taipei City, Taiwan, containing transaction date, house age, MRT distance, convenience store count, coordinates, and price per unit area.
# Block 1: Load UCI Real Estate Valuation dataset url = "https://www.kaggle.com/datasets/dskagglemt/real-estate-valuation-by-uci/download" df = pd.read_csv(url) print(df.head()) print(df.info()) print(df.describe())
Data Preprocessing
Columns are renamed for clarity. We drop any incomplete records, then define X as numerical predictors (house age, distance, store count, latitude, longitude) and y as the target price (house_price).
Dataset features: transaction_date, house_age, distance_to_MRT, num_stores, latitude, longitude, land_lot_size, house_price
# Block 2: Clean & prepare features
# Rename for clarity
df = df.rename(columns={
'X2 house age':'house_age',
'X3 distance to the nearest MRT station':'distance_to_MRT',
'X4 number of convenience stores':'num_stores',
'X5 latitude':'latitude',
'X6 longitude':'longitude',
'Y house price of unit area':'house_price'
})
# Drop any rows with missing values
df = df.dropna(subset=[
'transaction date','house_age','distance_to_MRT',
'num_stores','latitude','longitude','house_price'
])
# Define predictors and target
X = df[[
'house_age','distance_to_MRT','num_stores',
'latitude','longitude'
]]
y = df['house_price']
# Optional: scale or transform features if needed
Train/Test Split
We hold out 20% of the data to evaluate generalization.
# Block 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function implements:
- Forward inclusion: adds the excluded variable with the lowest p‑value < 0.01.
- Backward elimination: removes the included variable with the highest p‑value > 0.05.
Iteration stops when no further changes occur, yielding a parsimonious set of significant predictors.
# Block 4: Forward–backward stepwise feature selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: consider adding each excluded predictor
excluded = [col for col in X.columns if col not in included]
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:20} p-value {best_pval:.4f}")
# Backward step: consider removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:20} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- We fit an Ordinary Least Squares regression on the selected features via statsmodels. The .summary() output details coefficient estimates, p‑values, R², adjusted R², and diagnostic statistics (AIC, F‑statistic).
- We predict on held‑out data and compute Test R² (variance explained) and RMSE (prediction error scale) to quantify model performance.
# Block 5: Select features
selected = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residuals vs. predicted plot checks for non‑random dispersion or heteroscedasticity—validating OLS assumptions and model reliability.
# Block 6: Plot residuals vs. predicted to check assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--", color="gray")
plt.xlabel("Predicted House Price (Unit Area)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Price")
plt.show()
Summary
By applying stepwise regression to real‐estate transaction data, we isolate the key drivers of unit‐area price—such as proximity to MRT stations and number of nearby stores—while excluding less informative variables.
The resulting linear model balances interpretability (clear coefficient impacts and significance tests) with predictive accuracy (reasonable R² and low RMSE), providing real‑estate analysts and appraisers a transparent tool to forecast property values and guide pricing strategies.