Real Estate Rental Cost Prediction using Stepwise regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Property brokers and owners need actual forecasts of monthly rental rates to set competitive prices, assess investment opportunities, and optimize portfolio returns.
In this Real Estate Rental cost prediction in ML project, we will predict the rent price for residential listings based on property attributes—such as square footage, number of bedrooms/bathrooms, furnishing status, and locality features (e.g., proximity to transit, crime index).
By applying stepwise regression, we’ll identify the most significant drivers of rental cost and build a parsimonious linear model that balances interpretability with predictive accuracy—helping stakeholders price units more effectively.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Ordinary Least Squares regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a dataset of 4,700+ rental listings covering key features—BHK, Size_sqft, Floor, Area_Type, Location, Furnishing, and Rent. Initial inspection (.info(), .describe()) checks data completeness and distributions.
# Block 1: Load dataset
# House Rent Prediction Dataset – Kaggle :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/download"
df = pd.read_csv(url)
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
We simplify Area_Type categories, map Furnishing to a binary indicator, and drop any missing records. One‑hot encoding converts high‑cardinality Location and Area_Type into dummy variables. The cleaned feature matrix X excludes Rent, our response y.
# Block 2: Clean & encode features
# Select relevant columns and drop missing entries
cols = ['BHK', 'Size_sqft', 'Floor', 'Area_Type', 'Location', 'Furnishing', 'Rent']
df = df[cols].dropna()
# Simplify Area_Type and Furnishing
df['Area_Type'] = df['Area_Type'].map({'Super built-up Area':'SuperBuiltUp',
'Built-up Area':'BuiltUp',
'Carpet Area':'Carpet'})
df['Furnishing'] = df['Furnishing'].fillna('Semi').map({'Full':1, 'Semi':0, 'Unfurnished':0})
# One‑hot encode categorical predictors
df_enc = pd.get_dummies(df,
columns=['Area_Type','Location'],
drop_first=True)
# Define predictors and target
X = df_enc.drop('Rent', axis=1)
y = df_enc['Rent']
# Train–test split (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function performs:
- Forward inclusion: adds the excluded predictor with the lowest p‑value < 0.01.
- Backward elimination: removes the included predictor with the highest p‑value > 0.05. Iteration continues until no further changes are warranted, yielding a concise subset of significant variables.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test addition of each excluded predictor
excluded = list(set(X.columns) - set(included))
pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
pvals[col] = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
best_pval = pvals.min()
if best_pval < threshold_in:
best_var = pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:25} p-value {best_pval:.4f}")
# Backward step: test removal of each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals_in = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals_in.max()
if worst_pval > threshold_out:
worst_var = pvals_in.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- We fit an Ordinary Least Squares regression via statsmodels on the selected features. The .summary() provides coefficient estimates, p‑values, R², and diagnostic statistics, clarifying each predictor’s impact on rent.
- Predictions on the held‑out test set produce R² (variance explained) and RMSE (root‑mean‑square error), quantifying how well the model generalizes.
# Block 4: Select features
selected = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residual vs. predicted plot checks for non‑random patterns or heteroscedasticity, validating the linear model’s assumptions.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle='--')
plt.xlabel("Predicted Rent (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Rent")
plt.show()
Summary
Using stepwise regression on a large rental‐listing dataset, we isolate the most influential factors—such as unit size, number of bedrooms, location dummies, and furnishing status—while pruning redundant features.
The final linear model balances transparency (few, statistically significant predictors) with predictive accuracy (high test‐set R², low RMSE), empowering landlords and real estate analysts with a straightforward tool to forecast rental rates and optimize pricing strategies.