Property Investment Return Prediction with Lasso & Ridge Mixed Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Real‑estate investors size up dozens of attributes—neighbourhood, floor area, property age, walk‑score—yet still gamble on the ultimate return on investment (ROI) for each unit. Classic linear regression struggles with multicollinearity (price, area, bedrooms), while a pure Lasso model can over-shrink relevant variables. Elastic Net blends Ridge’s stability with Lasso’s automatic feature selection, yielding a sparse yet reliable model that forecasts per‑property ROI before the offer is signed.

Libraries Required

Purpose	Library
Core data wrangling	pandas, numpy
Visualisation	matplotlib, seaborn
ML pipeline	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset Link

Real Estate Data with pre‑calculated ROI field 

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

Dataset — property listings with sale price, rent, taxes, location, and a ready‑made ROI label (annualised percentage).

# One‑time shell command (Kaggle API key required):
# kaggle datasets download -d affiliatearmy/real-estate-data -p data --unzip

df = pd.read_csv("data/real_estate_data.csv")   # adjust filename if different

3. Quick EDA

print(df.head())
print(df[['ROI','Price','Bedrooms','Bathrooms']].describe())
sns.histplot(df['ROI'], kde=True); plt.title('Distribution of ROI'); plt.show()

4. Define target and features

y = df['ROI']                       # percentage return, already in dataset
X = df.drop(columns=['ROI','Property_ID'])  # drop unique identifier

# Identify column types
cat_cols = X.select_dtypes(include='object').columns
num_cols = X.select_dtypes(exclude='object').columns

5. Pre‑processing pipeline

Pipeline: ColumnTransformer one‑hot‑encodes categorical fields (city, property type) and z‑scales numeric ones (price, lot size) so penalties are applied fairly; wrapping all in a Pipeline prevents data leakage during CV.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
        ('num', StandardScaler(), num_cols)
    ])

6. Train/test split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=df['City'])

7. Build & tune Elastic Net

GridSearchCV explores 180 hyper‑parameter pairs (20 α × 9 mix ratios) to minimise cross‑validated RMSE.
Elastic Net logic — combines Ridge’s ℓ² term (curbs coefficient blow‑up under multicollinearity) with Lasso’s ℓ¹ term (drives tiny effects to zero), giving a model that is both robust and interpretable.

pipe = Pipeline([
        ('prep', preprocess),
        ('enet', ElasticNet(max_iter=10000, random_state=42))
    ])

# α controls overall penalty; l1_ratio sets Ridge‑vs‑Lasso mix
param_grid = {'enet__alpha': np.logspace(-3, 1, 20),
              'enet__l1_ratio': np.linspace(0.1, 0.9, 9)}

search = GridSearchCV(pipe, param_grid, cv=5,
                      scoring='neg_root_mean_squared_error',
                      n_jobs=-1, verbose=1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])

8. Evaluate on the hold‑out set

RMSE reports average error in ROI percentage points (easy for investors to digest), while R2R^{2} shows share of variance explained.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.2f} ROI points | R²: {r2:.3f}")

9. Interpret coefficients

Non-zero coefficients reveal ROI levers: perhaps short-term rental licences, waterfront views, or low property-tax counties. Zeroed dummies can be ignored when sourcing future deals.

ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

# Reverse scaling for numeric features
scales = search.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
enet_coefs = search.best_estimator_.named_steps['enet'].coef_
# Divide numeric coeffs by their scales to return to original units
final_coefs = enet_coefs.copy()
final_coefs[-len(num_cols):] = enet_coefs[-len(num_cols):] / scales

imp = (pd.Series(final_coefs, index=feature_names)
         .sort_values(key=abs, ascending=False))

plt.figure(figsize=(8,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – ROI Drivers')
plt.xlabel('Coefficient (Δ ROI %)'); plt.show()

Summary

This end‑to‑end notebook shows how an Elastic Net “mixed regression” model can turn raw listing data into:

Accurate ROI forecasts before money changes hands.
A succinct ranking of profit drivers, enabling buyers to zero in on high‑return neighbourhoods and amenities.
A tidy, one‑click retraining flow—new market data in, .fit() out—keeping the model evergreen with minimal upkeep.

Armed with these insights, investors can negotiate smarter, finance wisely, and build portfolios that compound returns instead of surprises.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook