Environmental Cleanup Cost Prediction using Quantile Regression
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Environmental agencies and remediation contractors must budget for a wide range of site cleanup costs—from relatively minor soil‑removal jobs (25th percentile) to large‑scale, complex Superfund projects (75th percentile).
A single average cost estimate obscures tail risk and can lead to under‑ or over‑provisioning of funds. Here, we’ll predict the 25th, 50th, and 75th percentiles of cleanup cost (USD) for New York State remediation sites based on site attributes—such as contaminant concentration, contamination area, depth to groundwater, proximity to water bodies, contaminant type, and land‑use classification—by fitting separate quantile regression models.
This distribution‑aware approach equips decision‑makers to allocate budgets conservatively, plan around typical expenditures, and reserve contingency funds for high‑cost cleanups.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts
Dataset
NYS Environmental Remediation Sites
Step-by-Step Code Implementation
Load & Inspect Data
We import the NYS remediation sites data—containing site metrics and EstimatedCleanupCost—and inspect its structure, types, and cost distribution (.describe()) to gauge skew and range.
# Load the NYS Environmental Remediation Sites dataset
# Source: Kaggle :contentReference[oaicite:1]{index=1}
df = pd.read_csv("NYS_Environmental_Remediation_Sites.csv")
# Quick inspection
print(df.head())
print(df.info())
print(df['EstimatedCleanupCost'].describe())
Preprocessing & Feature Engineering
- We drop records missing any of our seven core variables to ensure a clean modelling set.
- Categorical features—Contaminant_Type and Land_Use—are transformed via one‑hot encoding (dropping the first level) to capture programmatic and land‑use effects.
- We assemble a predictor list that combines four numeric site attributes and the generated dummy variables, and then rename the target to Cost.
# Keep only records with cost and core features
df = df.dropna(subset=[
'EstimatedCleanupCost',
'Max_Concentration_mg_L',
'Contamination_Area_m2',
'Depth_to_Groundwater_m',
'Proximity_to_Water_m',
'Contaminant_Type',
'Land_Use'
])
# One‑hot encode categorical features
df_enc = pd.get_dummies(
df,
columns=['Contaminant_Type','Land_Use'],
drop_first=True
)
# Define predictors and response
features = [
'Max_Concentration_mg_L',
'Contamination_Area_m2',
'Depth_to_Groundwater_m',
'Proximity_to_Water_m'
] + [c for c in df_enc.columns
if c.startswith('Contaminant_Type_') or c.startswith('Land_Use_')]
data = df_enc[features + ['EstimatedCleanupCost']].rename(
columns={'EstimatedCleanupCost':'Cost'}
)
Train/Test Split
We randomly reserve 20% of sites for out‑of‑sample evaluation, ensuring our quantile forecasts generalise to unseen remediation cases.
# 80/20 split for evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each percentile (25th, 50th, 75th):
- We define a regression formula linking Cost to all predictors.
- We fit a QuantReg model on the training data at that quantile.
- We print the coefficient table—revealing how each feature’s marginal impact on cleanup cost shifts across the lower, median, and upper tails (e.g., depth-to-groundwater may drive costs more in extreme cases).
quantiles = [0.25, 0.50, 0.75]
models = {}
formula = "Cost ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
models[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # show coefficient table only
Evaluation with Pinball Loss
- We predict quantile‑specific costs on the held‑out test set.
- We compute pinball loss for each quantile—a loss function tailored to quantile forecasts that penalises under‑ and over‑prediction asymmetrically. Lower pinball loss indicates more accurate and well‑calibrated quantile estimates.
for q, res in models.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Cost'], preds, alpha=q)
print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")
Summary
Quantile regression provides distribution‑aware cost forecasts for environmental remediation:
- The 25th‑percentile model supports conservative budgeting and prepares for relatively straightforward cleanups.
- The median (50th‑percentile) model forecasts typical remediation costs for standard planning.
- The 75th‑percentile model anticipates high‑cost, complex cleanups, enabling contingency reserves for worst‑case scenarios.
By modeling multiple quantiles, environmental managers and finance teams gain robust insights into cost variability—optimizing budget allocation, mitigating financial risk, and ensuring sufficient resources across the full spectrum of remediation challenges.