Urban Planning Cost Prediction with Ridge & Lasso Mixed Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

City planners must estimate the final construction costs of parks, libraries, roads, and other capital assets well before shovels hit the ground. Scope creep, site constraints, and agency practices all push budgets off course. Accurate early estimates let planners size bonds correctly, phase work sensibly, and avoid embarrassing overruns.

We will build an Elastic Net regression model—combining Ridge’s stability with Lasso’s sparsity—to predict a project’s updated budget (USD) from publicly available descriptors: managing agency, project type, borough, start year, design phase length, planned duration, and more. The mixed penalty controls collinearity among time‑related fields while discarding weak predictors, delivering a lean, interpretable model.

Libraries Required

Goal	Library
Data handling	pandas, numpy
Visuals	matplotlib, seaborn
ML pipeline	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset Link

NYC Capital Project Schedules and Budgets

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

New York City’s capital‑project ledger lists thousands of park, road, school, and waterfront jobs with planned vs. current budgets and schedules.

# one‑time terminal command (Kaggle API key required):
# kaggle datasets download -d new-york-city/nyc-capital-project-schedules-and-budgets -p data --unzip

df = pd.read_csv("data/Capital_Projects_PDB_Latest.csv")  # filename inside the zip

3. Basic clean‑up & target creation

Current_Budget is the most recent cost projection; predicting it from early‑life descriptors simulates real‑world estimating.
Managing agency, project type, borough (categorical) and durations or start year (numeric). These are known shortly after scoping, well before overruns occur.

# keep only active or completed jobs with numeric budgets
keep = df[df['Current_Budget'].notna()]
keep = keep[keep['Current_Budget'] > 0]

# choose a few intuitive predictors
cols = ['Managing Agency', 'Project Type', 'Borough',      # categorical
        'Original_Schedule_Duration', 'Design_Duration',   # numeric
        'Construction_Duration', 'Calendar_Year_Started']  # numeric

model_df = keep[cols + ['Current_Budget']].dropna()

4. Define features (X) and target (y)

y = model_df['Current_Budget']          # USD we aim to predict
X = model_df.drop(columns='Current_Budget')

cat_cols = ['Managing Agency', 'Project Type', 'Borough']
num_cols = [c for c in X.columns if c not in cat_cols]

5. Build an Elastic Net pipeline

One-hot encoding + scaling happens inside CV folds, eliminating data leakage and keeping code tidy.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
        ('num', StandardScaler(), num_cols)
    ])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=20000, random_state=42))
])

6. Train/test split & hyper‑parameter search

α governs total penalty strength (higher α → more shrinkage).
l1_ratio slides between Ridge (0 = pure ℓ²) and Lasso (1 = pure ℓ¹).
Scanning 180 (20 × 9) combinations balances bias, variance, and sparsity.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

param_grid = {
    'enet__alpha': np.logspace(-3, 1, 20),      # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

search = GridSearchCV(pipe, param_grid,
                      cv=5,
                      scoring='neg_root_mean_squared_error',
                      n_jobs=-1, verbose=1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])

7. Evaluate on the hold‑out set

RMSE tells planners the average dollar error per project; R2R^{2} shows explanatory power.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

8. Interpret coefficients

Non-zero bars reveal high-cost agencies, borough premiums, and schedule-driven inflation, helping decision-makers focus audits on the priciest drivers.

# recover final feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

# reverse scaling for numeric columns
scales = search.best_estimator_.named_steps['prep'] \
            .named_transformers_['num'].scale_
coef = search.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales

imp = (pd.Series(coef, index=feature_names)
         .sort_values(key=abs, ascending=False))

plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Capital‑Project Cost (Elastic Net Coefficients)')
plt.xlabel('Δ Budget (USD)'); plt.tight_layout(); plt.show()

Summary

This Elastic Net pipeline converts raw municipal planning data into a dollar‑level cost forecast and a concise rank‑ordering of budget drivers. Planners can:

Benchmark early estimates against data‑driven predictions to catch under‑budgeting.
Spot which agencies, boroughs, or durations overruns inflate costs the most.
Refresh forecasts quarterly—adding the latest project rows and calling. fit () retrains the entire workflow.

The result: fewer surprises, smarter funding decisions, and greater public trust in big‑ticket urban projects.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

Urban Planning Cost Prediction with Ridge & Lasso Mixed Regression in ML

Libraries Required

Dataset Link

Step-by-Step Code Implementation

1. Import Libraries

2. Download and load the dataset

3. Basic clean‑up & target creation

4. Define features (X) and target (y)

5. Build an Elastic Net pipeline

6. Train/test split & hyper‑parameter search

7. Evaluate on the hold‑out set

8. Interpret coefficients