Real Estate Investment Prediction using Quantile Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Real estate investors seek not just an average home price forecast but a view into the distribution of possible outcomes—understanding both conservative valuations (25th percentile) and upside potential (75th percentile).
In real estate investment prediction in an ML project, we’ll predict the 25th, 50th, and 75th percentiles of residential property sale prices in King County, WA, based on features like living area, bedroom/bathroom count, lot size, age, and geographic coordinates.
By fitting separate quantile regression models for each target percentile, we’ll uncover how each factor’s influence shifts across the valuation spectrum—equipping investors with distribution‐aware price forecasts for risk‐adjusted decision‐making.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts
Dataset
House Sales in King County, USA
Step-by-Step Code Implementation
Load & Inspect Data
We import ~21,000 home‐sale records (May 2014–May 2015) with variables like sale price, living area, lot size, room counts, year built, and coordinates. Initial .info() and .describe() confirm structure and price distribution.
# Load the King County house sales data
# Source: Kaggle – House Sales in King County, USA :contentReference[oaicite:1]{index=1}
df = pd.read_csv("kc_house_data.csv")
# Peek at the data and basic stats
print(df.head())
print(df.info())
print(df['price'].describe())
Preprocessing & Feature Selection
We rename columns for clarity, select eight numeric predictors (LivingArea, LotSize, Beds, Baths, Floors, YearBuilt, Latitude, Longitude), and drop any incomplete rows.
# Rename for brevity
df = df.rename(columns={
'price':'Price',
'sqft_living':'LivingArea',
'sqft_lot':'LotSize',
'bedrooms':'Beds',
'bathrooms':'Baths',
'floors':'Floors',
'yr_built':'YearBuilt',
'lat':'Latitude',
'long':'Longitude'
})
# Select predictors and target, drop any missing
features = [
'LivingArea','LotSize','Beds','Baths',
'Floors','YearBuilt','Latitude','Longitude'
]
df = df.dropna(subset=features + ['Price'])
Train/Test Split
We randomly reserve 20% of the data for out‐of‐sample testing, ensuring unbiased evaluation of our quantile models.
# Hold out 20% for evaluation
train, test = train_test_split(
df[features + ['Price']],
test_size=0.2,
random_state=42
)
Fit Quantile Regression Models
For each quantile (25th, 50th, 75th):
- We construct a formula (“Price ~ LivingArea + LotSize + …”).
- We fit a QuantReg model on the training set at that quantile.
- We print the coefficient table only, showing how each feature’s marginal effect on price shifts across the lower, median, and upper tails of the price distribution.
quantiles = [0.25, 0.50, 0.75]
models = {}
formula = "Price ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
models[q] = res
print(f"\n=== {int(q*100)}th Percentile Coefficients ===")
print(res.summary().tables[1]) # coefficient table only
Evaluation with Pinball Loss
- We predict quantile‐specific prices on the test set.
- We compute pinball loss for each quantile, a proper scoring rule for quantile forecasts that penalises under‑ and over‑predictions according to the target percentile.
- Lower pinball loss indicates better quantile calibration and thus more reliable distribution‐aware valuations.
for q, res in models.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Price'], preds, alpha=q)
print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")
Summary
Quantile regression delivers tail‐specific price forecasts that extend beyond average estimates:
- The 25th‑percentile model highlights conservative valuations, therefore helpful for risk‐averse lenders or investors seeking bargain opportunities.
- The median (50th‑percentile) model predicts typical market prices for standard appraisal and budgeting.
- The 75th‑percentile model illuminates premium market segments—guiding upside potential analysis for luxury developments or high‐demand neighborhoods.
These distribution‐aware forecasts empower real‐estate investors and analysts with a nuanced understanding of market variability. Hence, supporting risk‐adjusted investment strategies, robust underwriting, and more informed portfolio allocation.