Environmental Cleanup Cost Prediction in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Environmental agencies and contractors must predict the cleanup cost for contaminated sites. This covers soil removal, groundwater treatment, and disposal—to allocate budgets effectively and prioritize interventions.

In this environmental cleanup cost prediction in ML project, we will predict the EstimatedCleanupCost (USD) for remediation sites based on site attributes such as contaminant type, contaminant concentration, area of contamination, depth to groundwater, land use classification, and proximity to water bodies.

By applying stepwise regression, we’ll identify the most significant cost drivers and build an interpretable linear model—helping decision‑makers target resources to the highest‑impact sites.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

NYS Environmental Remediation Sites

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import the NYS remediation‑sites dataset, examining data types, missingness, and summary statistics to ensure critical fields (including EstimatedCleanupCost) are present.

# Block 1: Load NYS Environmental Remediation Sites dataset
# Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("NYS_Environmental_Remediation_Sites.csv")

# Inspect top rows and schema
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Records missing any core variables are dropped. Categorical features (Contaminant_Type, Land_Use) are transformed via one‑hot encoding. We assemble a feature matrix X of numeric and dummy variables and set y to the cleanup‐cost target.

# Block 2: Clean & encode features
# Drop rows missing critical fields
df = df.dropna(subset=[
    'EstimatedCleanupCost',
    'Contaminant_Type',
    'Max_Concentration_mg_L',
    'Contamination_Area_m2',
    'Depth_to_Groundwater_m',
    'Land_Use',
    'Proximity_to_Water_m'
])

# One‑hot encode categorical variables
df_enc = pd.get_dummies(df, 
                        columns=['Contaminant_Type','Land_Use'], 
                        drop_first=True)

# Define predictors X and target y
feature_cols = [
    'Max_Concentration_mg_L',
    'Contamination_Area_m2',
    'Depth_to_Groundwater_m',
    'Proximity_to_Water_m'
] + [c for c in df_enc.columns if c.startswith('Contaminant_Type_') 
     or c.startswith('Land_Use_')]

X = df_enc[feature_cols]
y = df_enc['EstimatedCleanupCost']

Train/Test Split

An 80/20 split ensures we can evaluate model generalization on held‑out data.

# Block 3: Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our custom function alternates:

Forward inclusion: adds the excluded predictor with the lowest p -value (< 0.01).
Backward elimination: removes the included predictor with the highest p -value (> 0.05).

Iteration stops when no further features qualify, yielding a concise set of statistically significant drivers.

# Block 4: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: assess each excluded feature
        excluded = [col for col in X.columns if col not in included]
        pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            pvals[col] = model.pvalues[col]
        best_pval = pvals.min()
        if best_pval < threshold_in:
            best_feat = pvals.idxmin()
            included.append(best_feat)
            changed = True
            if verbose:
                print(f"Add  {best_feat:30} p-value {best_pval:.4f}")

        # Backward step: remove least significant among included
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals_incl = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals_incl.max()
        if worst_pval > threshold_out:
            worst_feat = pvals_incl.idxmax()
            included.remove(worst_feat)
            changed = True
            if verbose:
                print(f"Drop {worst_feat:30} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression on the selected features using statsmodels. The output’s coefficients quantify the marginal USD impact of each feature; p -values assess significance; R² and the F‑statistic gauge fit quality.
Predictions on the test set yield Test R² (explained variance) and RMSE (average error scale), quantifying model accuracy on unseen sites.

# Block 5: Perform stepwise selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model on selected features
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set and compute metrics
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual vs. predicted plot checks for non‑random patterns or heteroscedasticity, validating key OLS assumptions and identifying potential outliers.

# Block 6: Residual vs. predicted plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='gray', linestyle='--')
plt.xlabel("Predicted Cleanup Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

Applying stepwise regression to environmental remediation data distills the key cost drivers—such as contaminant concentration, area—and encodes categorical factors like contaminant type and land use—while pruning less informative variables.

The resulting linear model strikes a strong balance between interpretability (clear coefficients, p -values) and prediction accuracy (high test‑set R², low RMSE), providing environmental managers with a transparent tool to forecast cleanup budgets and prioritize remediation efforts effectively.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook