Property Investment Return Prediction in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Real estate investors seek to understand not just the average home-sale price but also the distribution of potential returns—planning for conservative (25th percentile), typical (50th percentile), and upside (75th percentile) outcomes.
In this project, we’ll predict the 25th, 50th, and 75th percentiles of sale price (USD) for residential properties in King County, WA. Using features such as living area, lot size, bedrooms, bathrooms, year built, and location coordinates, we’ll fit separate quantile regression models to reveal how each factor’s impact shifts across the price distribution—enabling investors to perform risk‐aware valuation and portfolio stress‐testing.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Quantile‐specific loss metric
Dataset
House Sales in King County, USA
Step-by-Step Code Implementation
Load & Inspect Data
We load ~21 000 home‐sale records (May 2014–May 2015) with variables like sale price and property attributes. Initial .info() and .describe() verify data integrity and reveal price skew.
# Load the King County house sales data
df = pd.read_csv("kc_house_data.csv")
# Peek at schema and price distribution
print(df.head())
print(df.info())
print(df['price'].describe())
Preprocessing & Feature Selection
- We rename columns for concise formulas and drop any incomplete rows.
- We define eight numeric predictors—living area, lot size, bedroom/bathroom counts, floors, construction year, and coordinates—forming our feature matrix.
# Rename columns for clarity
df = df.rename(columns={
'price':'Price',
'sqft_living':'LivingArea',
'sqft_lot':'LotSize',
'bedrooms':'Beds',
'bathrooms':'Baths',
'floors':'Floors',
'yr_built':'YearBuilt',
'lat':'Latitude',
'long':'Longitude'
})
# Drop records with missing critical fields
df = df.dropna(subset=[
'LivingArea','LotSize','Beds','Baths',
'Floors','YearBuilt','Latitude','Longitude','Price'
])
# Select features and response
features = [
'LivingArea','LotSize','Beds','Baths',
'Floors','YearBuilt','Latitude','Longitude'
]
data = df[features + ['Price']]
Train/Test Split
A random 80/20 split reserves 20% of records for out‑of‑sample evaluation, ensuring our quantile models generalise.
# Hold out 20% for out‑of‑sample evaluation
train, test = train_test_split(
data, test_size=0.2, random_state=42
)
Fit Quantile Regression Models
For each quantile (25th, 50th, 75th):
- We build a formula string (“Price ~ LivingArea + LotSize + …”).
- We fit a QuantReg model at that percentile on the training set.
- We print the coefficient table, showing how each predictor’s marginal effect on price changes across the lower, median, and upper price tails.
quantiles = [0.25, 0.50, 0.75]
models = {}
formula = "Price ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
models[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # coefficient estimates
Evaluation with Pinball Loss
- We predict quantile‐specific prices on the test set.
- We compute the pinball loss for each model—a loss function tailored to quantile forecasts that penalises under- and over-predictions asymmetrically according to the target percentile. Lower pinball loss indicates better calibration.
for q, res in models.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Price'], preds, alpha=q)
print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")
Summary
Quantile regression equips real‐estate investors with distribution‐aware valuation tools:
- The 25th‑percentile model yields conservative price estimates—appropriate for underwriting and stress-testing downside risk.
- The median (50th‑percentile) model forecasts typical market prices for standard appraisal.
- The 75th‑percentile model highlights upside potential—guiding acquisition strategies in high‐growth submarkets.
By modelling multiple price quantiles, investors can perform risk‐sensitive portfolio analysis, allocate capital across scenarios, and make informed buy/sell decisions under market variability.