Real Estate Development Cost Prediction using Step wise Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Real estate developers must accurately find the total development cost that includes land acquisition, construction, permits, and soft costs. In this project, we will predict Development_Cost for residential properties based on features such as land area, number of floors, built‑up area, locality characteristics (median income, proximity to amenities), and historical construction costs in the region. By applying stepwise regression, we’ll isolate the most impactful drivers of cost and build an interpretable linear model that balances simplicity with predictive accuracy—helping developers budget projects more reliably and optimize their ROI planning.
Libraries Required
import pandas as pd # Data manipulation import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation import matplotlib.pyplot as plt # Visualization
Dataset
Real Estate Properties Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a broad real estate properties dataset—enhanced with a Development_Cost column—and examine its structure and summary statistics to understand variable ranges.
# Block 1: Load dataset url = "https://www.kaggle.com/datasets/shudhanshusingh/real-estate-properties-dataset/download" df = pd.read_csv(url) # Assume the dataset has an added column 'Development_Cost' print(df.head()) print(df.info()) print(df.describe())
Data Preprocessing
Categorical fields (Locality, Property_Type) are one‑hot encoded. We drop any records missing key numeric features (land area, built‑up area, floors, income indicators) or the target. We then split the remaining features (X) and the response (y) into 80% for training and 20% for testing.
# Block 2: Clean & encode features
# One‑hot encode categorical locality or property type if present
categorical_cols = ['Locality', 'Property_Type']
df_enc = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
# Drop rows with missing key columns
df_enc = df_enc.dropna(subset=[
'Land_Area', 'BuiltUp_Area', 'Num_Floors',
'Median_Income', 'Proximity_Amenities', 'Development_Cost'
])
# Define predictors and target
X = df_enc.drop([
'Property_ID', 'Sale_Price', 'Purchase_Price', 'Development_Cost'
], axis=1, errors='ignore')
y = df_enc['Development_Cost']
# Train–test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function alternates between forward inclusion (adding the excluded predictor with the lowest p‑value below 0.01) and backward elimination (removing the included predictor with the highest p‑value above 0.05) until convergence, yielding a concise set of significant cost drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:25} p-value {best_pval:.4f}")
# Backward step
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The summary provides coefficient estimates, p‑values, R², and diagnostic metrics, clarifying each predictor’s marginal impact on development cost.
We predict on the held‑out test set and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify how well the model generalises to new projects.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A scatter plot of residuals versus predicted costs checks for non‑random patterns or heteroscedasticity, validating OLS assumptions and ensuring model reliability.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Development Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
By applying stepwise regression to real estate development data, we isolate the most influential factors—such as land area, built‑up area, number of floors, and local median income—while pruning redundant variables. The final linear model strikes a strong balance between interpretability (few, statistically significant predictors) and predictive performance (high test‑set R², low RMSE), equipping developers with a transparent tool to forecast project costs, improve budget accuracy, and optimise financial planning.