Environmental Cleanup Cost Prediction in ML
We offer you a brighter future with FREE online courses - Start Now!!
Environmental agencies and contractors must predict the cleanup cost for contaminated sites. This covers soil removal, groundwater treatment, and disposal—to allocate budgets effectively and prioritize interventions.
In this environmental cleanup cost prediction in ML project, we will predict the EstimatedCleanupCost (USD) for remediation sites based on site attributes such as contaminant type, contaminant concentration, area of contamination, depth to groundwater, land use classification, and proximity to water bodies.
By applying stepwise regression, we’ll identify the most significant cost drivers and build an interpretable linear model—helping decision‑makers target resources to the highest‑impact sites.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Ordinary Least Squares regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
NYS Environmental Remediation Sites
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We import the NYS remediation‑sites dataset, examining data types, missingness, and summary statistics to ensure critical fields (including EstimatedCleanupCost) are present.
# Block 1: Load NYS Environmental Remediation Sites dataset
# Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("NYS_Environmental_Remediation_Sites.csv")
# Inspect top rows and schema
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
Records missing any core variables are dropped. Categorical features (Contaminant_Type, Land_Use) are transformed via one‑hot encoding. We assemble a feature matrix X of numeric and dummy variables and set y to the cleanup‐cost target.
# Block 2: Clean & encode features
# Drop rows missing critical fields
df = df.dropna(subset=[
'EstimatedCleanupCost',
'Contaminant_Type',
'Max_Concentration_mg_L',
'Contamination_Area_m2',
'Depth_to_Groundwater_m',
'Land_Use',
'Proximity_to_Water_m'
])
# One‑hot encode categorical variables
df_enc = pd.get_dummies(df,
columns=['Contaminant_Type','Land_Use'],
drop_first=True)
# Define predictors X and target y
feature_cols = [
'Max_Concentration_mg_L',
'Contamination_Area_m2',
'Depth_to_Groundwater_m',
'Proximity_to_Water_m'
] + [c for c in df_enc.columns if c.startswith('Contaminant_Type_')
or c.startswith('Land_Use_')]
X = df_enc[feature_cols]
y = df_enc['EstimatedCleanupCost']
Train/Test Split
An 80/20 split ensures we can evaluate model generalization on held‑out data.
# Block 3: Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
Our custom function alternates:
- Forward inclusion: adds the excluded predictor with the lowest p -value (< 0.01).
- Backward elimination: removes the included predictor with the highest p -value (> 0.05).
Iteration stops when no further features qualify, yielding a concise set of statistically significant drivers.
# Block 4: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: assess each excluded feature
excluded = [col for col in X.columns if col not in included]
pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
pvals[col] = model.pvalues[col]
best_pval = pvals.min()
if best_pval < threshold_in:
best_feat = pvals.idxmin()
included.append(best_feat)
changed = True
if verbose:
print(f"Add {best_feat:30} p-value {best_pval:.4f}")
# Backward step: remove least significant among included
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals_incl = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals_incl.max()
if worst_pval > threshold_out:
worst_feat = pvals_incl.idxmax()
included.remove(worst_feat)
changed = True
if verbose:
print(f"Drop {worst_feat:30} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- We fit an Ordinary Least Squares regression on the selected features using statsmodels. The output’s coefficients quantify the marginal USD impact of each feature; p -values assess significance; R² and the F‑statistic gauge fit quality.
- Predictions on the test set yield Test R² (explained variance) and RMSE (average error scale), quantifying model accuracy on unseen sites.
# Block 5: Perform stepwise selection
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model on selected features
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set and compute metrics
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residual vs. predicted plot checks for non‑random patterns or heteroscedasticity, validating key OLS assumptions and identifying potential outliers.
# Block 6: Residual vs. predicted plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='gray', linestyle='--')
plt.xlabel("Predicted Cleanup Cost (USD)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
Applying stepwise regression to environmental remediation data distills the key cost drivers—such as contaminant concentration, area—and encodes categorical factors like contaminant type and land use—while pruning less informative variables.
The resulting linear model strikes a strong balance between interpretability (clear coefficients, p -values) and prediction accuracy (high test‑set R², low RMSE), providing environmental managers with a transparent tool to forecast cleanup budgets and prioritize remediation efforts effectively.