Pollution Control Cost Prediction with Ridge Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Municipal utilities and industrial plants spend significant sums each year on pollution‑control activities—installing scrubbers, operating electrostatic precipitators, running wastewater treatment units, and monitoring stack emissions. Finance directors and environmental‐compliance teams need an early, data‑backed forecast of next year’s pollution‑control costs so they can:

set realistic capital and operating budgets,
negotiate allowance purchases or technology upgrades, and
Brief investors on the financial impact of tightening regulations.

Using the open “Toxics Release Inventory (Section 8) Cost Data” dataset (self‑reported by U.S. industrial facilities under Section 8 of the EPA Toxics Release Inventory), we will build a Ridge‑regression model that predicts each facility’s annual pollution‑control cost in USD from variables the plant already knows:

Feature block	Example columns in the dataset
Facility meta	State, industry (NAICS code), parent company size
Process scale	Total production (lbs), on‑site releases (lbs)
Abatement mix	Capital spent on recycling, treatment, and energy recovery
Historical trend	Prior‑year cost, 3‑year rolling reduction rate
Calendar cues	Reporting year

Ridge regression (linear model + L2 penalty) preserves coefficient interpretability—every weight is a dollar number—while damping multicollinearity between overlapping abatement categories.

Libraries Required

pandas # load / tidy the CSV
numpy # numeric helpers
matplotlib.pyplot # optional diagnostics
scikit‑learn # preprocessing, RidgeCV, metrics
joblib # persist the trained pipeline

Dataset Link

Toxics Release Inventory (Section 8) Cost Data

Step-by-Step Code Implementation

1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the Kaggle dataset

df = pd.read_csv("Toxics_Release_Inventory.csv")   # adjust path
print(df[['facility_name', 'reporting_year',
          'total_cost_usd']].head())

Key fields (abbreviated)

column	description
total_cost_usd	target – reported pollution‑control spend
prev_year_cost_usd	previous‑year cost
production_lbs	total production weight
onsite_release_lbs	TRI on‑site releases
recycle_capex_usd	Capital cost for recycling equipment
treatment_capex_usd	Capital cost for treatment equipment
energy_recovery_capex_usd	idem
state	two‑letter code
naics3	three‑digit NAICS industry code
reporting_year	integer

3. Basic cleaning & feature lists

Converts state, naics3, and reporting_year into binary columns so the model can learn separate cost offsets for each region, industry, and program year.
Puts all monetary and mass variables on the same statistical scale; Ridge’s L2 penalty then shrinks coefficients evenly rather than penalising only the biggest‑magnitude columns.

core_cols = ['total_cost_usd','prev_year_cost_usd','production_lbs',
             'onsite_release_lbs','recycle_capex_usd',
             'treatment_capex_usd','energy_recovery_capex_usd',
             'state','naics3','reporting_year']
df = df.dropna(subset=core_cols).copy()

num_cols = ['prev_year_cost_usd','production_lbs','onsite_release_lbs',
            'recycle_capex_usd','treatment_capex_usd',
            'energy_recovery_capex_usd']
cat_cols = ['state','naics3','reporting_year']
target   = 'total_cost_usd'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing + Ridge‑CV pipeline

Runs a five‑fold cross‑validation over a grid of α values and stores the one that yields the lowest validation error.

preprocess = ColumnTransformer([
        ('cats', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('nums', StandardScaler(),                      num_cols)
])

ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100, 300], cv=5)

pipe = Pipeline([
        ('prep',  preprocess),
        ('model', ridge)
])

5. Train–test split & fit

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Evaluate hold‑out accuracy

pred = pipe.predict(X_test)

print(f"Selected α (L2 strength): {pipe.named_steps['model'].alpha_}")
print(f"R² on test set          : {r2_score(y_test, pred):.3f}")
print(f"MAE on test set         : ${mean_absolute_error(y_test, pred):,.0f}")

7. Coefficient insight

Coefficients remain in dollar units. Example: a +$2.1 M coefficient on production_lbs (per σ) quantifies the added cost of high‑volume operations, while a −$180 k coefficient on recycle_capex_usd shows capital spending on recycling tends to lower future expenses.

# rebuild full feature list
ohe = pipe.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = (pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
         .sort_values())

print("\nLargest cost *reducers* (negative weights):")
print(coefs.head(8))

print("\nLargest cost *drivers* (positive weights):")
print(coefs.tail(8))

Numeric features were z‑scored, so each numeric coefficient is the USD change for a one‑standard‑deviation increase in that metric; each one‑hot flag is a dollar offset versus the reference group.

8. Persist the model

Recycling, treatment and energy‑recovery investments are correlated; OLS inflates their weights in opposite directions. Ridge stabilises the solution, improves generalisation and still keeps the model linear.

joblib.dump(pipe, "ridge_pollution_cost_model.pkl")

Summary

In about 120 lines of code, we transformed EPA TRI cost reports into an explainable pollution‑control cost forecaster:

Practical benefit: budget officers can preview next‑year abatement expenses months before purchasing allowances or chemicals.
Transparent levers: every variable’s dollar impact is clear, helping sustainability teams justify process‑efficiency projects.
Benchmark: any future gradient‑boosted or Bayesian model must beat this Ridge MAE and remain just as defensible to auditors and regulators.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook