Property Value Prediction using Quantile Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Traditional home‐valuation models estimate the mean sale price. Still, lenders, appraisers, and investors need to anticipate the range of plausible prices—understanding both conservative (lower‐quantile) and optimistic (upper‐quantile) outcomes.
In this property value prediction ML project, we’ll predict the 25th, 50th, and 75th percentiles of residential property sale prices in King County, WA, based on features like square footage, bedrooms, bathrooms, age, and location. By fitting separate linear quantile regression models, we’ll uncover how each predictor’s influence shifts across the lower, median, and upper tails of the price distribution—equipping stakeholders with distribution‑aware valuations for risk management and opportunity identification.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts
Dataset
House Sales in King County, USA
Step-by-Step Code Implementation
Load & Inspect Data
We load ~21,613 King County home sales—featuring sale prices and property attributes—and inspect their schema and price distribution (.describe()) to understand the range and dispersion.
# Load the King County house sales dataset
# Source: Kaggle :contentReference[oaicite:1]{index=1}
df = pd.read_csv("kc_house_data.csv")
# Inspect top rows and summary
print(df.head())
print(df.info())
print(df['price'].describe())
Preprocessing & Feature Engineering
- We rename “price” to “Price” for clarity.
- We select eleven predictors—living area, room counts, structural ratings, year built, and geographic coordinates—ensuring no missing values remain.
# Rename target for ease
df.rename(columns={'price':'Price'}, inplace=True)
# Select key predictors
# sqft_living, bedrooms, bathrooms, floors, waterfront, view, condition, grade, yr_built, lat, long
features = [
'sqft_living','bedrooms','bathrooms','floors',
'waterfront','view','condition','grade',
'yr_built','lat','long'
]
# Drop any missing values (none expected)
df = df.dropna(subset=features + ['Price'])
Train/Test Split
We randomly hold out 20% of records for evaluation, creating train and test sets to assess out‑of‑sample quantile forecasts.
# Reserve 20% of data for out‑of‑sample evaluation
train, test = train_test_split(df[features + ['Price']],
test_size=0.2,
random_state=42)
Fit Quantile Regression Models
For each target quantile (25th, 50th, 75th):
- We specify a formula linking Price to all predictors.
- We fit a QuantReg model at that percentile on the training set.
- We print only the coefficient table, which shows how each feature’s marginal effect varies across the lower, median, and upper price distribution (e.g., an extra bathroom may add more value in premium homes than entry‑level ones).
quantiles = [0.25, 0.50, 0.75]
results = {}
formula = "Price ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # coefficient table only
Evaluation with Pinball Loss
- We predict quantile‑specific prices on the held‑out test set.
- We compute pinball loss for each quantile forecast—a loss function tailored to quantile estimates—quantifying the average weighted penalty for under‑ and over‑prediction. Lower pinball loss indicates better calibrated quantile models.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Price'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
By modelling the 25th, 50th, and 75th percentiles of property prices rather than only the mean, we gain distribution‑aware valuations:
- The 25th‑percentile model highlights features driving lower‑end home values—informing conservative appraisals and entry‑level market assessments
- The median (50th‑percentile) model captures typical feature impacts for the bulk of the market
- The 75th‑percentile model focuses on premium segments, showing which upgrades most boost high‑end values.
These tailored quantile estimates empower real‑estate professionals to price properties under varying market conditions—while managing risk for conservative lending and identifying high‑value opportunities in the upper tail.