Property Investment Return Prediction in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Real estate investors seek to understand not just the average home-sale price but also the distribution of potential returns—planning for conservative (25th percentile), typical (50th percentile), and upside (75th percentile) outcomes.

In this project, we’ll predict the 25th, 50th, and 75th percentiles of sale price (USD) for residential properties in King County, WA. Using features such as living area, lot size, bedrooms, bathrooms, year built, and location coordinates, we’ll fit separate quantile regression models to reveal how each factor’s impact shifts across the price distribution—enabling investors to perform risk‐aware valuation and portfolio stress‐testing.

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf       # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Quantile‐specific loss metric 

Dataset

House Sales in King County, USA

Step-by-Step Code Implementation

Load & Inspect Data

We load ~21 000 home‐sale records (May 2014–May 2015) with variables like sale price and property attributes. Initial .info() and .describe() verify data integrity and reveal price skew.

# Load the King County house sales data 
df = pd.read_csv("kc_house_data.csv")

# Peek at schema and price distribution
print(df.head())
print(df.info())
print(df['price'].describe())

Preprocessing & Feature Selection

  • We rename columns for concise formulas and drop any incomplete rows.
  • We define eight numeric predictors—living area, lot size, bedroom/bathroom counts, floors, construction year, and coordinates—forming our feature matrix.
# Rename columns for clarity
df = df.rename(columns={
    'price':'Price',
    'sqft_living':'LivingArea',
    'sqft_lot':'LotSize',
    'bedrooms':'Beds',
    'bathrooms':'Baths',
    'floors':'Floors',
    'yr_built':'YearBuilt',
    'lat':'Latitude',
    'long':'Longitude'
})

# Drop records with missing critical fields
df = df.dropna(subset=[
    'LivingArea','LotSize','Beds','Baths',
    'Floors','YearBuilt','Latitude','Longitude','Price'
])

# Select features and response
features = [
    'LivingArea','LotSize','Beds','Baths',
    'Floors','YearBuilt','Latitude','Longitude'
]
data = df[features + ['Price']]

Train/Test Split

A random 80/20 split reserves 20% of records for out‑of‑sample evaluation, ensuring our quantile models generalise.

# Hold out 20% for out‑of‑sample evaluation
train, test = train_test_split(
    data, test_size=0.2, random_state=42
)

Fit Quantile Regression Models

For each quantile (25th, 50th, 75th):

  • We build a formula string (“Price ~ LivingArea + LotSize + …”).
  • We fit a QuantReg model at that percentile on the training set.
  • We print the coefficient table, showing how each predictor’s marginal effect on price changes across the lower, median, and upper price tails.
quantiles = [0.25, 0.50, 0.75]
models    = {}
formula   = "Price ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # coefficient estimates

Evaluation with Pinball Loss

  • We predict quantile‐specific prices on the test set.
  • We compute the pinball loss for each model—a loss function tailored to quantile forecasts that penalises under- and over-predictions asymmetrically according to the target percentile. Lower pinball loss indicates better calibration.
for q, res in models.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Price'], preds, alpha=q)
    print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")

Summary

Quantile regression equips real‐estate investors with distribution‐aware valuation tools:

  • The 25th‑percentile model yields conservative price estimates—appropriate for underwriting and stress-testing downside risk.
  • The median (50th‑percentile) model forecasts typical market prices for standard appraisal.
  • The 75th‑percentile model highlights upside potential—guiding acquisition strategies in high‐growth submarkets.

By modelling multiple price quantiles, investors can perform risk‐sensitive portfolio analysis, allocate capital across scenarios, and make informed buy/sell decisions under market variability.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *