Real Estate Development Cost Prediction with Ridge & Lasso Mixed Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Real‐estate developers need to forecast the total construction cost of a new residential project—before breaking ground—using only early‐stage indicators such as land area, number of units, average floor count, local material price index, and labor‐rate index. Costs scale nonlinearly with project size (bulk‐purchase discounts) and complexity (multistory premiums), and different predictors may dominate in different regimes (e.g., small vs. high‑rise). A pure Lasso model may over‐shrink bulk‐purchase effects, while a pure Ridge model may overemphasise minor predictors. By applying a Mixed Regression approach—combining Lasso (ℓ₁) and Ridge (ℓ₂) penalties via ElasticNet—we capture both variable selection and coefficient shrinkage, yielding robust, interpretable cost forecasts.

Dataset Link

Ames Housing dataset

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd                             # data I/O  
import numpy as np                              # numerical ops  

import matplotlib.pyplot as plt                 # plotting  
import seaborn as sns                           # visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler  
from sklearn.linear_model import ElasticNet  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

2. Load Data & Select Features

LotArea and GarageArea scale cost linearly;
OverallQual and YearBuilt proxy material/complexity;
TotRmsAbvGrd proxies unit count.

import pandas as pd

# Load train set; proxy cost ≈ SalePrice
df = pd.read_csv("data/train.csv")

# Select early‐stage features
features = [
    "LotArea",       # land area
    "OverallQual",   # overall material & finish quality
    "YearBuilt",     # proxy for complexity
    "TotRmsAbvGrd",  # number of units/rooms
    "GarageArea"     # scale of parking structure
]
X = df[features].copy()
y = df["SalePrice"].values

3. Preprocessing & Train/Test Split

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Impute if needed (here none missing for our features)
# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

4. Build ElasticNet Pipeline & Hyperparameter Search

We search α∈[0.01,100] and l1_ratio∈[0…1] over 5‑fold CV, optimising RMSE
StandardScaler zero‑means and unit‑scales features, so penalties apply uniformly.
α controls overall regularisation strength;
l1_ratio blends Lasso (feature selection) and Ridge (shrinkage).

from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
import numpy as np

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("enet", ElasticNet(max_iter=5000, random_state=42))
])

param_grid = {
    "enet__alpha": np.logspace(-2, 2, 10),    # overall penalty
    "enet__l1_ratio": np.linspace(0, 1, 6)    # mix between L1 and L2
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)

5. Evaluate on Test Set

RMSE measures average dollar error.
R² assesses variance explained.

from sklearn.metrics import mean_squared_error, r2_score

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: ${rmse:,.0f}")
print(f"Test R²  : {r2:.3f}")

6. Inspect Coefficients

Bars near zero indicate less‑important predictors.
Larger magnitude indicates stronger cost drivers, guiding focus on key development levers.

import pandas as pd
import matplotlib.pyplot as plt

coef = gs.best_estimator_.named_steps["enet"].coef_
feat_names = features

imp = pd.Series(coef, index=feat_names).sort_values()
plt.figure(figsize=(6,4))
imp.plot(kind="barh")
plt.title("ElasticNet Coefficients")
plt.xlabel("Coefficient Value")
plt.tight_layout()
plt.show()

Summary

By applying a Mixed (ElasticNet) Regression approach, developers gain:

Accurate forecasts of construction cost proxies (RMSE low, R² high).
Automatic feature selection and shrinkage, balancing model sparsity and stability.
Interpretable coefficients that highlight which early‐stage project attributes—such as land area or quality—most drive cost, enabling targeted value engineering.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google | Facebook