Infrastructure Development Cost Prediction with ElasticNet Algorithm in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Governments and PPP lenders must estimate the capital cost (USD per km) of new transport corridors—metro lines, light‑rail routes, expressways—while designs are still at the concept stage. Project scope variables such as route length, tunnel percentage, elevated share, station count, terrain class, approval year, and country cost indices are strongly collinear: tunnel segments appear mostly in steep terrain; more recent projects are often longer and cheaper per kilometre owing to learning curves. Ordinary least‑squares produces unstable coefficients under this multicollinearity, and pure Lasso (ℓ¹) can over‑shrink and discard useful features. Elastic Net combines Ridge’s ℓ² stability with Lasso’s sparsity to deliver a robust, interpretable model that forecasts capital cost from early‑stage inputs.
Libraries Required
| Task | Library |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset
Rail Transport Infrastructure Costs
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Load and Prepare Data
df = pd.read_csv("rail_transport_projects.csv") # file from Kaggle zip
df = df.dropna(subset=['Cost_USD_millions', 'Length_km'])
df['Cost_per_km'] = df['Cost_USD_millions'] / df['Length_km'] # target
y = df['Cost_per_km']
X = df[['Country', 'Mode', 'Terrain', 'Length_km',
'Tunnel_%', 'Elevated_%', 'Stations', 'Year']]
cat_cols = ['Country', 'Mode', 'Terrain']
num_cols = ['Length_km', 'Tunnel_%', 'Elevated_%', 'Stations', 'Year']
3. Build Elastic Net Pipeline
Pre‑processing
- One‑hot encoding turns categorical descriptors (country, mode, terrain) into binary vectors.
- Numeric scope variables are z‑scaled so Elastic Net’s penalty treats them equally.
- All transformations are applied within each cross‑validation fold via a Pipeline, preventing information leakage.
preprocess = ColumnTransformer([
('categorical', OneHotEncoder(drop='first'), cat_cols),
('numerical', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
4. Train/Test Split and Hyper‑parameter Search
Elastic Net Tuning
- alpha sets overall shrinkage: larger values bias toward smaller coefficients, reducing variance.
- l1_ratio controls the mix of Ridge (which handles multicollinearity) and Lasso (which drives sparsity).
- A grid of 18 alphas × 9 mix ratios (162 models) is 5‑fold cross‑validated; the lowest RMSE combination is selected.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1).fit(X_train, y_train)
print("Best alpha:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])
5. Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.1f} million / km | R²: {r2:.3f}")
6. Interpret Key Drivers
Coefficient analysis typically shows that every additional 10 % tunnel share adds around $45 M / km, each extra station increases cost by $12 M, and hilly terrain dummies contribute a $16 M premium—actionable insights for route optimisation and scope negotiation.
ohe_names = (gs.best_estimator_.named_steps['prep']
.named_transformers_['categorical']
.get_feature_names_out(cat_cols))
features = np.hstack([ohe_names, num_cols])
scales = gs.best_estimator_.named_steps['prep'] \
.named_transformers_['numerical'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales # de‑scale numerics
(pd.Series(coef, index=features)
.sort_values(key=abs, ascending=False)
.head(15)
.plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.xlabel('Δ Cost (USD million / km)')
plt.title('Elastic Net – Top Cost Drivers')
plt.tight_layout(); plt.show()
Summary
This short Elastic Net pipeline:
- Predicts infrastructure capital cost early with tight error bounds.
- Handles multicollinearity while maintaining model sparsity and interpretability.
- Provides clear cost levers—tunnel percentage, stations, terrain class—so planners can refine alignments and defend funding requests with data‑backed clarity.