Urban Expansion Cost Prediction with ElasticNet Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

City planners and real‑estate developers must estimate the capital cost (USD per m²) of urban expansion projects—new residential or mixed‑use districts—long before infrastructure bids are solicited. Early inputs such as gross floor area (GFA), building height, road network length, green‑space ratio, zoning category, and project year are highly collinear (taller buildings ↔ bigger GFA ↔ more roads), leading to unstable ordinary least-squares coefficients. At the same time, a pure Lasso model may over‑shrink and discard useful predictors. Elastic Net (Ridge ℓ² + Lasso ℓ¹) balances multicollinearity and sparsity, delivering a transparent, robust model for forecasting expansion costs from high‑level design parameters.

Libraries Required

Purpose Library
Data handling pandas, numpy
Visualization matplotlib, seaborn
ML pipeline scikit‑learnColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Evaluation metrics mean_squared_error, r2_score

Dataset

Construction Estimation Data

Step by Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Load and Inspect Data

df = pd.read_csv("construction_estimation_data.csv")  # adjust path if needed
print(df.head())
sns.histplot(df['Total_Cost_USD']/1e6, kde=True)
plt.xlabel('Cost (million USD)')
plt.title('Project Cost Distribution')
plt.show()

3. Define Target and Features

  • Target normalization: dividing total cost by GFA yields a unit cost (USD / m²), smoothing scale effects.
  • Elastic Net necessity: GFA, floors, façade area, and zoning are correlated; Ridge keeps bundles stable while Lasso spares noise.
# Normalize cost by GFA for unit metric
df['Cost_per_m2'] = df['Total_Cost_USD'] / df['GFA_m2']
y = df['Cost_per_m2']

X = df[['Project_Type','Structure_System','Zone_Class','City',
        'GFA_m2','Floors','Facade_Area_m2','Duration_Months','Year']]

cat_cols = ['Project_Type','Structure_System','Zone_Class','City']
num_cols = ['GFA_m2','Floors','Facade_Area_m2','Duration_Months','Year']

4. Build Elastic Net Pipeline

Pipeline and CV: ColumnTransformer handles encoding and scaling inside each fold, avoiding leakage; GridSearchCV evaluates 162 combinations (18 α × 9 l1 ratios) to minimise RMSE.

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first'), cat_cols),
    ('num', StandardScaler(),            num_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=20000, random_state=42))
])

5. Train/Test Split & Hyper‑parameter Search

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=df['Project_Type'])

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10  
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

gs = GridSearchCV(pipe, param_grid,
                  cv=5,
                  scoring='neg_root_mean_squared_error',
                  n_jobs=-1, verbose=1).fit(X_train, y_train)

print("Best alpha   :", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])

6. Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:.2f} per m² | R²: {r2:.3f}")

7. Interpret Key Drivers

Interpretation: Coefficients reveal, for example, that Steel‑Frame projects cost $120/120/m² more than concrete, each extra storey reduces unit cost by $5/m² (economies of scale), and Zone C‑Industrial carries a $30/m² premium due to heavy‑duty requirements.

ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef   = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales  # back‑scale numerics

pd.Series(coef, index=feature_names) \
  .sort_values(key=abs, ascending=False) \
  .head(12) \
  .plot(kind='barh', figsize=(10,6))
plt.gca().invert_yaxis()
plt.xlabel('Δ Cost (USD per m²)')
plt.title('Elastic Net – Top Cost Drivers for Urban Expansion')
plt.tight_layout()
plt.show()

Summary

This Elastic Net workflow produces a robust, interpretable model that:

  1. Forecasts unit expansion cost early with low out‑of‑sample error.
  2. Balances multicollinearity & sparsity, retaining essential scope drivers while discarding irrelevant dummies.
  3. Surfaces actionable levers—structural system, number of floors, zoning class—for planners and financiers to optimise budgets and negotiate more precise bids.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *