Urban Planning Cost Prediction with Ridge & Lasso Mixed Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
City planners must estimate the final construction costs of parks, libraries, roads, and other capital assets well before shovels hit the ground. Scope creep, site constraints, and agency practices all push budgets off course. Accurate early estimates let planners size bonds correctly, phase work sensibly, and avoid embarrassing overruns.
We will build an Elastic Net regression model—combining Ridge’s stability with Lasso’s sparsity—to predict a project’s updated budget (USD) from publicly available descriptors: managing agency, project type, borough, start year, design phase length, planned duration, and more. The mixed penalty controls collinearity among time‑related fields while discarding weak predictors, delivering a lean, interpretable model.
Libraries Required
| Goal | Library |
| Data handling | pandas, numpy |
| Visuals | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset Link
NYC Capital Project Schedules and Budgets
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
New York City’s capital‑project ledger lists thousands of park, road, school, and waterfront jobs with planned vs. current budgets and schedules.
# one‑time terminal command (Kaggle API key required):
# kaggle datasets download -d new-york-city/nyc-capital-project-schedules-and-budgets -p data --unzip
df = pd.read_csv("data/Capital_Projects_PDB_Latest.csv") # filename inside the zip
3. Basic clean‑up & target creation
- Current_Budget is the most recent cost projection; predicting it from early‑life descriptors simulates real‑world estimating.
- Managing agency, project type, borough (categorical) and durations or start year (numeric). These are known shortly after scoping, well before overruns occur.
# keep only active or completed jobs with numeric budgets
keep = df[df['Current_Budget'].notna()]
keep = keep[keep['Current_Budget'] > 0]
# choose a few intuitive predictors
cols = ['Managing Agency', 'Project Type', 'Borough', # categorical
'Original_Schedule_Duration', 'Design_Duration', # numeric
'Construction_Duration', 'Calendar_Year_Started'] # numeric
model_df = keep[cols + ['Current_Budget']].dropna()
4. Define features (X) and target (y)
y = model_df['Current_Budget'] # USD we aim to predict X = model_df.drop(columns='Current_Budget') cat_cols = ['Managing Agency', 'Project Type', 'Borough'] num_cols = [c for c in X.columns if c not in cat_cols]
5. Build an Elastic Net pipeline
One-hot encoding + scaling happens inside CV folds, eliminating data leakage and keeping code tidy.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
6. Train/test split & hyper‑parameter search
- α governs total penalty strength (higher α → more shrinkage).
- l1_ratio slides between Ridge (0 = pure ℓ²) and Lasso (1 = pure ℓ¹).
Scanning 180 (20 × 9) combinations balances bias, variance, and sparsity.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {
'enet__alpha': np.logspace(-3, 1, 20), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
search = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])
7. Evaluate on the hold‑out set
RMSE tells planners the average dollar error per project; R2R^{2} shows explanatory power.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
8. Interpret coefficients
Non-zero bars reveal high-cost agencies, borough premiums, and schedule-driven inflation, helping decision-makers focus audits on the priciest drivers.
# recover final feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# reverse scaling for numeric columns
scales = search.best_estimator_.named_steps['prep'] \
.named_transformers_['num'].scale_
coef = search.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales
imp = (pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Capital‑Project Cost (Elastic Net Coefficients)')
plt.xlabel('Δ Budget (USD)'); plt.tight_layout(); plt.show()
Summary
This Elastic Net pipeline converts raw municipal planning data into a dollar‑level cost forecast and a concise rank‑ordering of budget drivers. Planners can:
- Benchmark early estimates against data‑driven predictions to catch under‑budgeting.
- Spot which agencies, boroughs, or durations overruns inflate costs the most.
- Refresh forecasts quarterly—adding the latest project rows and calling. fit () retrains the entire workflow.
The result: fewer surprises, smarter funding decisions, and greater public trust in big‑ticket urban projects.