Infrastructure Cost Prediction with Ridge & Lasso Mixed Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Transport ministries and lenders sign multi‑billion‑dollar infrastructure contracts years before a single sleeper is laid. Accurately valuing a rail line, highway, or metro extension early on is critical to avoid cost overruns, bond downgrades, and political backlash. Classic linear regression explodes under multicollinearity—think line length vs tunnel percentage—while pure Lasso may drop significant terrain flags. An Elastic Net model, combining Ridge’s stability with Lasso’s automatic feature selection, provides a sparse yet robust estimator of project cost per kilometre (USD million/km).
Libraries Required
| Purpose | Python package |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Model metrics | mean_squared_error, r2_score |
Dataset Link
Rail Transport Infrastructure Costs
Step by Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
138 rail and metro projects from 14 countries, listing mode (tram/metro/heavy rail), length, % tunnelled, elevation share, terrain, stations, and final cost.
# one‑time shell command (Kaggle API key required):
# kaggle datasets download -d sujaykapadnis/rail-transport-infrastructure-costs -p data --unzip
df = pd.read_csv("data/rail_transport_projects.csv") # adjust filename if needed
Typical columns:
[‘Project’, ‘Country’,’ Mode’, ‘Year’, ‘Length_km’, ‘Tunnel_%’, ‘Elevated_%’, ‘Stations’ ,’Terrain’, ‘Cost_USD_millions’]
3. Quick EDA & target creation
Target cost is divided by length to produce USD million per km, standardising across short and long lines.
print(df.head()) df = df.dropna(subset=['Cost_USD_millions','Length_km']) # ensure clean rows # Predict cost normalised per km df['Cost_per_km'] = df['Cost_USD_millions'] / df['Length_km']
4. Define features & target
Categorical fields capture institutional effects (mode, country, terrain), and numeric fields quantify engineering scope (tunnel%, elevated%, stations).
y = df['Cost_per_km'] # USD million per km
X = df[['Mode','Country','Year','Length_km',
'Tunnel_%','Elevated_%','Stations','Terrain']]
cat_cols = ['Mode','Country','Terrain']
num_cols = [c for c in X.columns if c not in cat_cols]
5. Pre‑processing & Elastic Net pipeline
One‑hot encoding and scaling are encapsulated, preventing leakage and keeping inference code one‑line simple.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
6. Train/test split & hyper‑parameter grid search
- α (alpha) controls total shrinkage.
- l1_ratio tilts towards Ridge (0 ≈ keep all) or Lasso (1 ≈ drop many).
- A 20×9 grid (180 combos) finds the sweet spot via five‑fold CV.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {
'enet__alpha': np.logspace(-3, 1, 20), # penalty strength 0.001→10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
search = GridSearchCV(pipe,
param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1,
verbose=1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])
7. Evaluate on the hold‑out set
RMSE tells planners the average €/km error; R2R^{2} shows the variance explained on unseen projects.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: ${rmse:,.2f} million/km | R²: {r2:.3f}")
8. Inspect feature importance
Non‑zero coefficients reveal that, e.g., every +10 % tunnel share adds ≈ $42 million/km, while trams (baseline) are far cheaper than metros in steep terrain.
# Recover full feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# Reverse scaling for numeric columns
scales = search.best_estimator_.named_steps['prep'] \
.named_transformers_['num'].scale_
coefs = search.best_estimator_.named_steps['enet'].coef_
coefs[-len(num_cols):] = coefs[-len(num_cols):] / scales
imp = (pd.Series(coefs, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Infrastructure Cost (Elastic Net Coefficients)')
plt.xlabel('Δ Cost ($ million/km)'); plt.tight_layout(); plt.show()
Summary
This end‑to‑end Elastic Net workflow converts raw rail‑project sheets into a defensible per‑kilometre cost forecast and a clear ranking of budget drivers. Infrastructure banks and municipalities can now:
- Stress‑test feasibility studies against data‑driven benchmarks.
- Pinpoint which scope factors (tunnels, stations, terrain) hurt budgets most.
- Update predictions instantly when new projects complete—drop fresh rows and run .fit().
With fewer than 150 lines of Python, urban‑rail cost estimation moves from gut feel to transparent, evidence‑based planning.