Pollution Control Cost Prediction with Ridge Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Municipal utilities and industrial plants spend significant sums each year on pollution‑control activities—installing scrubbers, operating electrostatic precipitators, running wastewater treatment units, and monitoring stack emissions. Finance directors and environmental‐compliance teams need an early, data‑backed forecast of next year’s pollution‑control costs so they can:

  • set realistic capital and operating budgets,
  • negotiate allowance purchases or technology upgrades, and
  • Brief investors on the financial impact of tightening regulations.

Using the open “Toxics Release Inventory (Section 8) Cost Data” dataset (self‑reported by U.S. industrial facilities under Section 8 of the EPA Toxics Release Inventory), we will build a Ridge‑regression model that predicts each facility’s annual pollution‑control cost in USD from variables the plant already knows:

Feature block Example columns in the dataset
Facility meta State, industry (NAICS code), parent company size
Process scale Total production (lbs), on‑site releases (lbs)
Abatement mix Capital spent on recycling, treatment, and energy recovery
Historical trend Prior‑year cost, 3‑year rolling reduction rate
Calendar cues Reporting year

Ridge regression (linear model + L2 penalty) preserves coefficient interpretability—every weight is a dollar number—while damping multicollinearity between overlapping abatement categories.

Libraries Required

  • pandas # load / tidy the CSV
  • numpy # numeric helpers
  • matplotlib.pyplot # optional diagnostics
  • scikit‑learn # preprocessing, RidgeCV, metrics
  • joblib # persist the trained pipeline

Dataset Link

Toxics Release Inventory (Section 8) Cost Data

Step-by-Step Code Implementation

1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the Kaggle dataset

df = pd.read_csv("Toxics_Release_Inventory.csv")   # adjust path
print(df[['facility_name', 'reporting_year',
          'total_cost_usd']].head())

Key fields (abbreviated)

column description
total_cost_usd target – reported pollution‑control spend
prev_year_cost_usd previous‑year cost
production_lbs total production weight
onsite_release_lbs TRI on‑site releases
recycle_capex_usd Capital cost for recycling equipment
treatment_capex_usd Capital cost for treatment equipment
energy_recovery_capex_usd idem
state two‑letter code
naics3 three‑digit NAICS industry code
reporting_year integer

3. Basic cleaning & feature lists

  • Converts state, naics3, and reporting_year into binary columns so the model can learn separate cost offsets for each region, industry, and program year.
  • Puts all monetary and mass variables on the same statistical scale; Ridge’s L2 penalty then shrinks coefficients evenly rather than penalising only the biggest‑magnitude columns.
core_cols = ['total_cost_usd','prev_year_cost_usd','production_lbs',
             'onsite_release_lbs','recycle_capex_usd',
             'treatment_capex_usd','energy_recovery_capex_usd',
             'state','naics3','reporting_year']
df = df.dropna(subset=core_cols).copy()

num_cols = ['prev_year_cost_usd','production_lbs','onsite_release_lbs',
            'recycle_capex_usd','treatment_capex_usd',
            'energy_recovery_capex_usd']
cat_cols = ['state','naics3','reporting_year']
target   = 'total_cost_usd'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing + Ridge‑CV pipeline

Runs a five‑fold cross‑validation over a grid of α values and stores the one that yields the lowest validation error.

preprocess = ColumnTransformer([
        ('cats', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('nums', StandardScaler(),                      num_cols)
])

ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100, 300], cv=5)

pipe = Pipeline([
        ('prep',  preprocess),
        ('model', ridge)
])

5. Train–test split & fit

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Evaluate hold‑out accuracy

pred = pipe.predict(X_test)

print(f"Selected α (L2 strength): {pipe.named_steps['model'].alpha_}")
print(f"R² on test set          : {r2_score(y_test, pred):.3f}")
print(f"MAE on test set         : ${mean_absolute_error(y_test, pred):,.0f}")

7. Coefficient insight

Coefficients remain in dollar units. Example: a +$2.1 M coefficient on production_lbs (per σ) quantifies the added cost of high‑volume operations, while a −$180 k coefficient on recycle_capex_usd shows capital spending on recycling tends to lower future expenses.

# rebuild full feature list
ohe = pipe.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])

coefs = (pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
         .sort_values())

print("\nLargest cost *reducers* (negative weights):")
print(coefs.head(8))

print("\nLargest cost *drivers* (positive weights):")
print(coefs.tail(8))

Numeric features were z‑scored, so each numeric coefficient is the USD change for a one‑standard‑deviation increase in that metric; each one‑hot flag is a dollar offset versus the reference group.

8. Persist the model

Recycling, treatment and energy‑recovery investments are correlated; OLS inflates their weights in opposite directions. Ridge stabilises the solution, improves generalisation and still keeps the model linear.

joblib.dump(pipe, "ridge_pollution_cost_model.pkl")

Summary

In about 120 lines of code, we transformed EPA TRI cost reports into an explainable pollution‑control cost forecaster:

  • Practical benefit: budget officers can preview next‑year abatement expenses months before purchasing allowances or chemicals.
  • Transparent levers: every variable’s dollar impact is clear, helping sustainability teams justify process‑efficiency projects.
  • Benchmark: any future gradient‑boosted or Bayesian model must beat this Ridge MAE and remain just as defensible to auditors and regulators.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *