Real Estate Rental Cost Prediction using Stepwise regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Property brokers and owners need actual forecasts of monthly rental rates to set competitive prices, assess investment opportunities, and optimize portfolio returns.

In this Real Estate Rental cost prediction in ML project, we will predict the rent price for residential listings based on property attributes—such as square footage, number of bedrooms/bathrooms, furnishing status, and locality features (e.g., proximity to transit, crime index).

By applying stepwise regression, we’ll identify the most significant drivers of rental cost and build a parsimonious linear model that balances interpretability with predictive accuracy—helping stakeholders price units more effectively.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

House Rent Prediction Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a dataset of 4,700+ rental listings covering key features—BHK, Size_sqft, Floor, Area_Type, Location, Furnishing, and Rent. Initial inspection (.info(), .describe()) checks data completeness and distributions.

# Block 1: Load dataset
# House Rent Prediction Dataset – Kaggle :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/download"
df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We simplify Area_Type categories, map Furnishing to a binary indicator, and drop any missing records. One‑hot encoding converts high‑cardinality Location and Area_Type into dummy variables. The cleaned feature matrix X excludes Rent, our response y.

# Block 2: Clean & encode features
# Select relevant columns and drop missing entries
cols = ['BHK', 'Size_sqft', 'Floor', 'Area_Type', 'Location', 'Furnishing', 'Rent']
df = df[cols].dropna()

# Simplify Area_Type and Furnishing
df['Area_Type'] = df['Area_Type'].map({'Super built-up  Area':'SuperBuiltUp',
                                       'Built-up  Area':'BuiltUp',
                                       'Carpet  Area':'Carpet'})
df['Furnishing'] = df['Furnishing'].fillna('Semi').map({'Full':1, 'Semi':0, 'Unfurnished':0})

# One‑hot encode categorical predictors
df_enc = pd.get_dummies(df,
                        columns=['Area_Type','Location'],
                        drop_first=True)

# Define predictors and target
X = df_enc.drop('Rent', axis=1)
y = df_enc['Rent']

# Train–test split (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function performs:

Forward inclusion: adds the excluded predictor with the lowest p‑value < 0.01.
Backward elimination: removes the included predictor with the highest p‑value > 0.05. Iteration continues until no further changes are warranted, yielding a concise subset of significant variables.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False
        # Forward step: test addition of each excluded predictor
        excluded = list(set(X.columns) - set(included))
        pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            pvals[col] = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
        best_pval = pvals.min()
        if best_pval < threshold_in:
            best_var = pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")
        # Backward step: test removal of each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals_in = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals_in.max()
        if worst_pval > threshold_out:
            worst_var = pvals_in.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")
        if not changed:
            break
    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression via statsmodels on the selected features. The .summary() provides coefficient estimates, p‑values, R², and diagnostic statistics, clarifying each predictor’s impact on rent.
Predictions on the held‑out test set produce R² (variance explained) and RMSE (root‑mean‑square error), quantifying how well the model generalizes.

# Block 4: Select features
selected = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual vs. predicted plot checks for non‑random patterns or heteroscedasticity, validating the linear model’s assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle='--')
plt.xlabel("Predicted Rent (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Rent")
plt.show()

Summary

Using stepwise regression on a large rental‐listing dataset, we isolate the most influential factors—such as unit size, number of bedrooms, location dummies, and furnishing status—while pruning redundant features.

The final linear model balances transparency (few, statistically significant predictors) with predictive accuracy (high test‐set R², low RMSE), empowering landlords and real estate analysts with a straightforward tool to forecast rental rates and optimize pricing strategies.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook