Property Investment Return Prediction with Lasso & Ridge Mixed Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Real‑estate investors size up dozens of attributes—neighbourhood, floor area, property age, walk‑score—yet still gamble on the ultimate return on investment (ROI) for each unit. Classic linear regression struggles with multicollinearity (price, area, bedrooms), while a pure Lasso model can over-shrink relevant variables. Elastic Net blends Ridge’s stability with Lasso’s automatic feature selection, yielding a sparse yet reliable model that forecasts per‑property ROI before the offer is signed.
Libraries Required
| Purpose | Library |
| Core data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset Link
Real Estate Data with pre‑calculated ROI field
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
Dataset — property listings with sale price, rent, taxes, location, and a ready‑made ROI label (annualised percentage).
# One‑time shell command (Kaggle API key required):
# kaggle datasets download -d affiliatearmy/real-estate-data -p data --unzip
df = pd.read_csv("data/real_estate_data.csv") # adjust filename if different
3. Quick EDA
print(df.head())
print(df[['ROI','Price','Bedrooms','Bathrooms']].describe())
sns.histplot(df['ROI'], kde=True); plt.title('Distribution of ROI'); plt.show()
4. Define target and features
y = df['ROI'] # percentage return, already in dataset X = df.drop(columns=['ROI','Property_ID']) # drop unique identifier # Identify column types cat_cols = X.select_dtypes(include='object').columns num_cols = X.select_dtypes(exclude='object').columns
5. Pre‑processing pipeline
Pipeline: ColumnTransformer one‑hot‑encodes categorical fields (city, property type) and z‑scales numeric ones (price, lot size) so penalties are applied fairly; wrapping all in a Pipeline prevents data leakage during CV.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
6. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['City'])
7. Build & tune Elastic Net
- GridSearchCV explores 180 hyper‑parameter pairs (20 α × 9 mix ratios) to minimise cross‑validated RMSE.
- Elastic Net logic — combines Ridge’s ℓ² term (curbs coefficient blow‑up under multicollinearity) with Lasso’s ℓ¹ term (drives tiny effects to zero), giving a model that is both robust and interpretable.
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=10000, random_state=42))
])
# α controls overall penalty; l1_ratio sets Ridge‑vs‑Lasso mix
param_grid = {'enet__alpha': np.logspace(-3, 1, 20),
'enet__l1_ratio': np.linspace(0.1, 0.9, 9)}
search = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])
8. Evaluate on the hold‑out set
RMSE reports average error in ROI percentage points (easy for investors to digest), while R2R^{2} shows share of variance explained.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} ROI points | R²: {r2:.3f}")
9. Interpret coefficients
Non-zero coefficients reveal ROI levers: perhaps short-term rental licences, waterfront views, or low property-tax counties. Zeroed dummies can be ignored when sourcing future deals.
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# Reverse scaling for numeric features
scales = search.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
enet_coefs = search.best_estimator_.named_steps['enet'].coef_
# Divide numeric coeffs by their scales to return to original units
final_coefs = enet_coefs.copy()
final_coefs[-len(num_cols):] = enet_coefs[-len(num_cols):] / scales
imp = (pd.Series(final_coefs, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(8,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – ROI Drivers')
plt.xlabel('Coefficient (Δ ROI %)'); plt.show()
Summary
This end‑to‑end notebook shows how an Elastic Net “mixed regression” model can turn raw listing data into:
- Accurate ROI forecasts before money changes hands.
- A succinct ranking of profit drivers, enabling buyers to zero in on high‑return neighbourhoods and amenities.
- A tidy, one‑click retraining flow—new market data in, .fit() out—keeping the model evergreen with minimal upkeep.
Armed with these insights, investors can negotiate smarter, finance wisely, and build portfolios that compound returns instead of surprises.