Real Estate Investment Prediction using Quantile Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Real estate investors seek not just an average home price forecast but a view into the distribution of possible outcomes—understanding both conservative valuations (25th percentile) and upside potential (75th percentile).

In real estate investment prediction in an ML project, we’ll predict the 25th, 50th, and 75th percentiles of residential property sale prices in King County, WA, based on features like living area, bedroom/bathroom count, lot size, age, and geographic coordinates.

By fitting separate quantile regression models for each target percentile, we’ll uncover how each factor’s influence shifts across the valuation spectrum—equipping investors with distribution‐aware price forecasts for risk‐adjusted decision‐making.

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf     # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss for quantile forecasts

Dataset

House Sales in King County, USA

Step-by-Step Code Implementation

Load & Inspect Data

We import ~21,000 home‐sale records (May 2014–May 2015) with variables like sale price, living area, lot size, room counts, year built, and coordinates. Initial .info() and .describe() confirm structure and price distribution.

# Load the King County house sales data
# Source: Kaggle – House Sales in King County, USA :contentReference[oaicite:1]{index=1}
df = pd.read_csv("kc_house_data.csv")

# Peek at the data and basic stats
print(df.head())
print(df.info())
print(df['price'].describe())

Preprocessing & Feature Selection

We rename columns for clarity, select eight numeric predictors (LivingArea, LotSize, Beds, Baths, Floors, YearBuilt, Latitude, Longitude), and drop any incomplete rows.

# Rename for brevity
df = df.rename(columns={
    'price':'Price',
    'sqft_living':'LivingArea',
    'sqft_lot':'LotSize',
    'bedrooms':'Beds',
    'bathrooms':'Baths',
    'floors':'Floors',
    'yr_built':'YearBuilt',
    'lat':'Latitude',
    'long':'Longitude'
})

# Select predictors and target, drop any missing
features = [
    'LivingArea','LotSize','Beds','Baths',
    'Floors','YearBuilt','Latitude','Longitude'
]
df = df.dropna(subset=features + ['Price'])

Train/Test Split

We randomly reserve 20% of the data for out‐of‐sample testing, ensuring unbiased evaluation of our quantile models.

# Hold out 20% for evaluation
train, test = train_test_split(
    df[features + ['Price']],
    test_size=0.2,
    random_state=42
)

Fit Quantile Regression Models

For each quantile (25th, 50th, 75th):

We construct a formula (“Price ~ LivingArea + LotSize + …”).
We fit a QuantReg model on the training set at that quantile.
We print the coefficient table only, showing how each feature’s marginal effect on price shifts across the lower, median, and upper tails of the price distribution.

quantiles = [0.25, 0.50, 0.75]
models    = {}
formula   = "Price ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n=== {int(q*100)}th Percentile Coefficients ===")
    print(res.summary().tables[1])   # coefficient table only

Evaluation with Pinball Loss

We predict quantile‐specific prices on the test set.
We compute pinball loss for each quantile, a proper scoring rule for quantile forecasts that penalises under‑ and over‑predictions according to the target percentile.
Lower pinball loss indicates better quantile calibration and thus more reliable distribution‐aware valuations.

for q, res in models.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Price'], preds, alpha=q)
    print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")

Summary

Quantile regression delivers tail‐specific price forecasts that extend beyond average estimates:

The 25th‑percentile model highlights conservative valuations, therefore helpful for risk‐averse lenders or investors seeking bargain opportunities.
The median (50th‑percentile) model predicts typical market prices for standard appraisal and budgeting.
The 75th‑percentile model illuminates premium market segments—guiding upside potential analysis for luxury developments or high‐demand neighborhoods.

These distribution‐aware forecasts empower real‐estate investors and analysts with a nuanced understanding of market variability. Hence, supporting risk‐adjusted investment strategies, robust underwriting, and more informed portfolio allocation.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook