Infrastructure Development Cost Prediction with ElasticNet Algorithm in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Governments and PPP lenders must estimate the capital cost (USD per km) of new transport corridors—metro lines, light‑rail routes, expressways—while designs are still at the concept stage. Project scope variables such as route length, tunnel percentage, elevated share, station count, terrain class, approval year, and country cost indices are strongly collinear: tunnel segments appear mostly in steep terrain; more recent projects are often longer and cheaper per kilometre owing to learning curves. Ordinary least‑squares produces unstable coefficients under this multicollinearity, and pure Lasso (ℓ¹) can over‑shrink and discard useful features. Elastic Net combines Ridge’s ℓ² stability with Lasso’s sparsity to deliver a robust, interpretable model that forecasts capital cost from early‑stage inputs.

Libraries Required

Task	Library
Data wrangling	pandas, numpy
Visualisation	matplotlib, seaborn
ML workflow	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset

Rail Transport Infrastructure Costs

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Load and Prepare Data

df = pd.read_csv("rail_transport_projects.csv")  # file from Kaggle zip
df = df.dropna(subset=['Cost_USD_millions', 'Length_km'])

df['Cost_per_km'] = df['Cost_USD_millions'] / df['Length_km']  # target
y = df['Cost_per_km']

X = df[['Country', 'Mode', 'Terrain', 'Length_km',
        'Tunnel_%', 'Elevated_%', 'Stations', 'Year']]

cat_cols = ['Country', 'Mode', 'Terrain']
num_cols = ['Length_km', 'Tunnel_%', 'Elevated_%', 'Stations', 'Year']

3. Build Elastic Net Pipeline

Pre‑processing

One‑hot encoding turns categorical descriptors (country, mode, terrain) into binary vectors.
Numeric scope variables are z‑scaled so Elastic Net’s penalty treats them equally.
All transformations are applied within each cross‑validation fold via a Pipeline, preventing information leakage.

preprocess = ColumnTransformer([
    ('categorical', OneHotEncoder(drop='first'), cat_cols),
    ('numerical',  StandardScaler(),              num_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=20000, random_state=42))
])

4. Train/Test Split and Hyper‑parameter Search

Elastic Net Tuning

alpha sets overall shrinkage: larger values bias toward smaller coefficients, reducing variance.
l1_ratio controls the mix of Ridge (which handles multicollinearity) and Lasso (which drives sparsity).
A grid of 18 alphas × 9 mix ratios (162 models) is 5‑fold cross‑validated; the lowest RMSE combination is selected.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

gs = GridSearchCV(pipe, param_grid,
                  cv=5,
                  scoring='neg_root_mean_squared_error',
                  n_jobs=-1, verbose=1).fit(X_train, y_train)

print("Best alpha:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])

5. Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:,.1f} million / km | R²: {r2:.3f}")

6. Interpret Key Drivers

Coefficient analysis typically shows that every additional 10 % tunnel share adds around $45 M / km, each extra station increases cost by $12 M, and hilly terrain dummies contribute a $16 M premium—actionable insights for route optimisation and scope negotiation.

ohe_names = (gs.best_estimator_.named_steps['prep']
               .named_transformers_['categorical']
               .get_feature_names_out(cat_cols))
features = np.hstack([ohe_names, num_cols])

scales = gs.best_estimator_.named_steps['prep'] \
            .named_transformers_['numerical'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales  # de‑scale numerics

(pd.Series(coef, index=features)
     .sort_values(key=abs, ascending=False)
     .head(15)
     .plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.xlabel('Δ Cost (USD million / km)')
plt.title('Elastic Net – Top Cost Drivers')
plt.tight_layout(); plt.show()

Summary

This short Elastic Net pipeline:

Predicts infrastructure capital cost early with tight error bounds.
Handles multicollinearity while maintaining model sparsity and interpretability.
Provides clear cost levers—tunnel percentage, stations, terrain class—so planners can refine alignments and defend funding requests with data‑backed clarity.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook