Infrastructure Cost Prediction with Ridge & Lasso Mixed Regression in ML

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

Transport ministries and lenders sign multi‑billion‑dollar infrastructure contracts years before a single sleeper is laid. Accurately valuing a rail line, highway, or metro extension early on is critical to avoid cost overruns, bond downgrades, and political backlash. Classic linear regression explodes under multicollinearity—think line length vs tunnel percentage—while pure Lasso may drop significant terrain flags. An Elastic Net model, combining Ridge’s stability with Lasso’s automatic feature selection, provides a sparse yet robust estimator of project cost per kilometre (USD million/km).

Libraries Required

Purpose	Python package
Data wrangling	pandas, numpy
Visualisation	matplotlib, seaborn
ML workflow	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Model metrics	mean_squared_error, r2_score

Dataset Link

Rail Transport Infrastructure Costs 

Step by Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

138 rail and metro projects from 14 countries, listing mode (tram/metro/heavy rail), length, % tunnelled, elevation share, terrain, stations, and final cost.

# one‑time shell command (Kaggle API key required):
# kaggle datasets download -d sujaykapadnis/rail-transport-infrastructure-costs -p data --unzip

df = pd.read_csv("data/rail_transport_projects.csv")      # adjust filename if needed

Typical columns:

[‘Project’, ‘Country’,’ Mode’, ‘Year’, ‘Length_km’, ‘Tunnel_%’, ‘Elevated_%’, ‘Stations’ ,’Terrain’, ‘Cost_USD_millions’]

3. Quick EDA & target creation

Target cost is divided by length to produce USD million per km, standardising across short and long lines.

print(df.head())
df = df.dropna(subset=['Cost_USD_millions','Length_km'])     # ensure clean rows

# Predict cost normalised per km
df['Cost_per_km'] = df['Cost_USD_millions'] / df['Length_km']

4. Define features & target

Categorical fields capture institutional effects (mode, country, terrain), and numeric fields quantify engineering scope (tunnel%, elevated%, stations).

y = df['Cost_per_km']           # USD million per km

X = df[['Mode','Country','Year','Length_km',
        'Tunnel_%','Elevated_%','Stations','Terrain']]

cat_cols = ['Mode','Country','Terrain']
num_cols = [c for c in X.columns if c not in cat_cols]

5. Pre‑processing & Elastic Net pipeline

One‑hot encoding and scaling are encapsulated, preventing leakage and keeping inference code one‑line simple.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
        ('num', StandardScaler(), num_cols)
    ])

pipe = Pipeline([
        ('prep', preprocess),
        ('enet', ElasticNet(max_iter=20000, random_state=42))
    ])

6. Train/test split & hyper‑parameter grid search

α (alpha) controls total shrinkage.
l1_ratio tilts towards Ridge (0 ≈ keep all) or Lasso (1 ≈ drop many).
A 20×9 grid (180 combos) finds the sweet spot via five‑fold CV.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

param_grid = {
    'enet__alpha': np.logspace(-3, 1, 20),      # penalty strength 0.001→10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

search = GridSearchCV(pipe,
                      param_grid,
                      cv=5,
                      scoring='neg_root_mean_squared_error',
                      n_jobs=-1,
                      verbose=1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])

7. Evaluate on the hold‑out set

RMSE tells planners the average €/km error; R2R^{2} shows the variance explained on unseen projects.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"RMSE: ${rmse:,.2f} million/km | R²: {r2:.3f}")

8. Inspect feature importance

Non‑zero coefficients reveal that, e.g., every +10 % tunnel share adds ≈ $42 million/km, while trams (baseline) are far cheaper than metros in steep terrain.

# Recover full feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

# Reverse scaling for numeric columns
scales = search.best_estimator_.named_steps['prep'] \
            .named_transformers_['num'].scale_
coefs = search.best_estimator_.named_steps['enet'].coef_
coefs[-len(num_cols):] = coefs[-len(num_cols):] / scales

imp = (pd.Series(coefs, index=feature_names)
         .sort_values(key=abs, ascending=False))

plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Infrastructure Cost (Elastic Net Coefficients)')
plt.xlabel('Δ Cost ($ million/km)'); plt.tight_layout(); plt.show()

Summary

This end‑to‑end Elastic Net workflow converts raw rail‑project sheets into a defensible per‑kilometre cost forecast and a clear ranking of budget drivers. Infrastructure banks and municipalities can now:

Stress‑test feasibility studies against data‑driven benchmarks.
Pinpoint which scope factors (tunnels, stations, terrain) hurt budgets most.
Update predictions instantly when new projects complete—drop fresh rows and run .fit().

With fewer than 150 lines of Python, urban‑rail cost estimation moves from gut feel to transparent, evidence‑based planning.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

Infrastructure Cost Prediction with Ridge & Lasso Mixed Regression in ML

Libraries Required

Dataset Link

Step by Step Code Implementation

1. Import Libraries

2. Download and load the dataset

3. Quick EDA & target creation

4. Define features & target

5. Pre‑processing & Elastic Net pipeline

6. Train/test split & hyper‑parameter grid search

7. Evaluate on the hold‑out set

8. Inspect feature importance