Real Estate Development Cost Prediction using Step wise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Real estate developers must accurately find the total development cost that includes land acquisition, construction, permits, and soft costs. In this project, we will predict Development_Cost for residential properties based on features such as land area, number of floors, built‑up area, locality characteristics (median income, proximity to amenities), and historical construction costs in the region. By applying stepwise regression, we’ll isolate the most impactful drivers of cost and build an interpretable linear model that balances simplicity with predictive accuracy—helping developers budget projects more reliably and optimize their ROI planning.

Libraries Required

import pandas as pd                                        # Data manipulation  
import numpy as np                                         # Numerical operations  
import statsmodels.api as sm                               # OLS regression  
from sklearn.model_selection import train_test_split       # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error   # Evaluation  
import matplotlib.pyplot as plt                            # Visualization

Dataset

Real Estate Properties Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a broad real estate properties dataset—enhanced with a Development_Cost column—and examine its structure and summary statistics to understand variable ranges.

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/shudhanshusingh/real-estate-properties-dataset/download"
df = pd.read_csv(url)

# Assume the dataset has an added column 'Development_Cost'
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Categorical fields (Locality, Property_Type) are one‑hot encoded. We drop any records missing key numeric features (land area, built‑up area, floors, income indicators) or the target. We then split the remaining features (X) and the response (y) into 80% for training and 20% for testing.

# Block 2: Clean & encode features
# One‑hot encode categorical locality or property type if present
categorical_cols = ['Locality', 'Property_Type']
df_enc = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Drop rows with missing key columns
df_enc = df_enc.dropna(subset=[
    'Land_Area', 'BuiltUp_Area', 'Num_Floors',
    'Median_Income', 'Proximity_Amenities', 'Development_Cost'
])

# Define predictors and target
X = df_enc.drop([
    'Property_ID', 'Sale_Price', 'Purchase_Price', 'Development_Cost'
], axis=1, errors='ignore')
y = df_enc['Development_Cost']

# Train–test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function alternates between forward inclusion (adding the excluded predictor with the lowest p‑value below 0.01) and backward elimination (removing the included predictor with the highest p‑value above 0.05) until convergence, yielding a concise set of significant cost drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The summary provides coefficient estimates, p‑values, R², and diagnostic metrics, clarifying each predictor’s marginal impact on development cost.

We predict on the held‑out test set and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify how well the model generalises to new projects.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A scatter plot of residuals versus predicted costs checks for non‑random patterns or heteroscedasticity, validating OLS assumptions and ensuring model reliability.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Development Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to real estate development data, we isolate the most influential factors—such as land area, built‑up area, number of floors, and local median income—while pruning redundant variables. The final linear model strikes a strong balance between interpretability (few, statistically significant predictors) and predictive performance (high test‑set R², low RMSE), equipping developers with a transparent tool to forecast project costs, improve budget accuracy, and optimise financial planning.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook