Urban Expansion Cost Prediction with ElasticNet Algorithm in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
City planners and real‑estate developers must estimate the capital cost (USD per m²) of urban expansion projects—new residential or mixed‑use districts—long before infrastructure bids are solicited. Early inputs such as gross floor area (GFA), building height, road network length, green‑space ratio, zoning category, and project year are highly collinear (taller buildings ↔ bigger GFA ↔ more roads), leading to unstable ordinary least-squares coefficients. At the same time, a pure Lasso model may over‑shrink and discard useful predictors. Elastic Net (Ridge ℓ² + Lasso ℓ¹) balances multicollinearity and sparsity, delivering a transparent, robust model for forecasting expansion costs from high‑level design parameters.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Visualization | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Evaluation metrics | mean_squared_error, r2_score |
Dataset
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Load and Inspect Data
df = pd.read_csv("construction_estimation_data.csv") # adjust path if needed
print(df.head())
sns.histplot(df['Total_Cost_USD']/1e6, kde=True)
plt.xlabel('Cost (million USD)')
plt.title('Project Cost Distribution')
plt.show()
3. Define Target and Features
- Target normalisation: dividing total cost by GFA yields a unit cost (USD / m²), smoothing scale effects.
- Elastic Net necessity: GFA, floors, façade area, and zoning are correlated; Ridge keeps bundles stable while Lasso spares noise.
# Normalize cost by GFA for unit metric
df['Cost_per_m2'] = df['Total_Cost_USD'] / df['GFA_m2']
y = df['Cost_per_m2']
X = df[['Project_Type','Structure_System','Zone_Class','City',
'GFA_m2','Floors','Facade_Area_m2','Duration_Months','Year']]
cat_cols = ['Project_Type','Structure_System','Zone_Class','City']
num_cols = ['GFA_m2','Floors','Facade_Area_m2','Duration_Months','Year']
4. Build Elastic Net Pipeline
Pipeline and CV: ColumnTransformer handles encoding and scaling inside each fold, avoiding leakage; GridSearchCV evaluates 162 combinations (18 α × 9 l1 ratios) to minimise RMSE.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first'), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
5. Train/Test Split & Hyper‑parameter Search
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['Project_Type'])
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1).fit(X_train, y_train)
print("Best alpha :", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])
6. Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:.2f} per m² | R²: {r2:.3f}")
7. Interpret Key Drivers
Interpretation: Coefficients reveal, for example, that Steel‑Frame projects cost $120/120/m² more than concrete, each extra storey reduces unit cost by $5/m² (economies of scale), and Zone C‑Industrial carries a $30/m² premium due to heavy‑duty requirements.
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales # back‑scale numerics
pd.Series(coef, index=feature_names) \
.sort_values(key=abs, ascending=False) \
.head(12) \
.plot(kind='barh', figsize=(10,6))
plt.gca().invert_yaxis()
plt.xlabel('Δ Cost (USD per m²)')
plt.title('Elastic Net – Top Cost Drivers for Urban Expansion')
plt.tight_layout()
plt.show()
Summary
This Elastic Net workflow produces a robust, interpretable model that:
- Forecasts unit expansion cost early with low out‑of‑sample error.
- Balances multicollinearity & sparsity, retaining essential scope drivers while discarding irrelevant dummies.
- Surfaces actionable levers—structural system, number of floors, zoning class—for planners and financiers to optimise budgets and negotiate more precise bids.