Environmental Cleanup Cost Prediction using Quantile Regression

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Environmental agencies and remediation contractors must budget for a wide range of site cleanup costs—from relatively minor soil‑removal jobs (25th percentile) to large‑scale, complex Superfund projects (75th percentile).

A single average cost estimate obscures tail risk and can lead to under‑ or over‑provisioning of funds. Here, we’ll predict the 25th, 50th, and 75th percentiles of cleanup cost (USD) for New York State remediation sites based on site attributes—such as contaminant concentration, contamination area, depth to groundwater, proximity to water bodies, contaminant type, and land‑use classification—by fitting separate quantile regression models.

This distribution‑aware approach equips decision‑makers to allocate budgets conservatively, plan around typical expenditures, and reserve contingency funds for high‑cost cleanups.

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf       # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss for quantile forecasts

Dataset

NYS Environmental Remediation Sites

Step-by-Step Code Implementation

Load & Inspect Data

We import the NYS remediation sites data—containing site metrics and EstimatedCleanupCost—and inspect its structure, types, and cost distribution (.describe()) to gauge skew and range.

# Load the NYS Environmental Remediation Sites dataset  
# Source: Kaggle :contentReference[oaicite:1]{index=1}  
df = pd.read_csv("NYS_Environmental_Remediation_Sites.csv")

# Quick inspection
print(df.head())
print(df.info())
print(df['EstimatedCleanupCost'].describe())

Preprocessing & Feature Engineering

We drop records missing any of our seven core variables to ensure a clean modelling set.
Categorical features—Contaminant_Type and Land_Use—are transformed via one‑hot encoding (dropping the first level) to capture programmatic and land‑use effects.
We assemble a predictor list that combines four numeric site attributes and the generated dummy variables, and then rename the target to Cost.

# Keep only records with cost and core features
df = df.dropna(subset=[
    'EstimatedCleanupCost',
    'Max_Concentration_mg_L',
    'Contamination_Area_m2',
    'Depth_to_Groundwater_m',
    'Proximity_to_Water_m',
    'Contaminant_Type',
    'Land_Use'
])

# One‑hot encode categorical features
df_enc = pd.get_dummies(
    df,
    columns=['Contaminant_Type','Land_Use'],
    drop_first=True
)

# Define predictors and response
features = [
    'Max_Concentration_mg_L',
    'Contamination_Area_m2',
    'Depth_to_Groundwater_m',
    'Proximity_to_Water_m'
] + [c for c in df_enc.columns
     if c.startswith('Contaminant_Type_') or c.startswith('Land_Use_')]

data = df_enc[features + ['EstimatedCleanupCost']].rename(
    columns={'EstimatedCleanupCost':'Cost'}
)

Train/Test Split

We randomly reserve 20% of sites for out‑of‑sample evaluation, ensuring our quantile forecasts generalise to unseen remediation cases.

# 80/20 split for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each percentile (25th, 50th, 75th):

We define a regression formula linking Cost to all predictors.
We fit a QuantReg model on the training data at that quantile.
We print the coefficient table—revealing how each feature’s marginal impact on cleanup cost shifts across the lower, median, and upper tails (e.g., depth-to-groundwater may drive costs more in extreme cases).

quantiles = [0.25, 0.50, 0.75]
models    = {}
formula   = "Cost ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # show coefficient table only

Evaluation with Pinball Loss

We predict quantile‑specific costs on the held‑out test set.
We compute pinball loss for each quantile—a loss function tailored to quantile forecasts that penalises under‑ and over‑prediction asymmetrically. Lower pinball loss indicates more accurate and well‑calibrated quantile estimates.

for q, res in models.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Cost'], preds, alpha=q)
    print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")

Summary

Quantile regression provides distribution‑aware cost forecasts for environmental remediation:

The 25th‑percentile model supports conservative budgeting and prepares for relatively straightforward cleanups.
The median (50th‑percentile) model forecasts typical remediation costs for standard planning.
The 75th‑percentile model anticipates high‑cost, complex cleanups, enabling contingency reserves for worst‑case scenarios.

By modeling multiple quantiles, environmental managers and finance teams gain robust insights into cost variability—optimizing budget allocation, mitigating financial risk, and ensuring sufficient resources across the full spectrum of remediation challenges.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook