Pollution Control Cost Prediction with Ridge Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Municipal utilities and industrial plants spend significant sums each year on pollution‑control activities—installing scrubbers, operating electrostatic precipitators, running wastewater treatment units, and monitoring stack emissions. Finance directors and environmental‐compliance teams need an early, data‑backed forecast of next year’s pollution‑control costs so they can:
- set realistic capital and operating budgets,
- negotiate allowance purchases or technology upgrades, and
- Brief investors on the financial impact of tightening regulations.
Using the open “Toxics Release Inventory (Section 8) Cost Data” dataset (self‑reported by U.S. industrial facilities under Section 8 of the EPA Toxics Release Inventory), we will build a Ridge‑regression model that predicts each facility’s annual pollution‑control cost in USD from variables the plant already knows:
| Feature block | Example columns in the dataset |
| Facility meta | State, industry (NAICS code), parent company size |
| Process scale | Total production (lbs), on‑site releases (lbs) |
| Abatement mix | Capital spent on recycling, treatment, and energy recovery |
| Historical trend | Prior‑year cost, 3‑year rolling reduction rate |
| Calendar cues | Reporting year |
Ridge regression (linear model + L2 penalty) preserves coefficient interpretability—every weight is a dollar number—while damping multicollinearity between overlapping abatement categories.
Libraries Required
- pandas # load / tidy the CSV
- numpy # numeric helpers
- matplotlib.pyplot # optional diagnostics
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # persist the trained pipeline
Dataset Link
Toxics Release Inventory (Section 8) Cost Data
Step-by-Step Code Implementation
1. Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the Kaggle dataset
df = pd.read_csv("Toxics_Release_Inventory.csv") # adjust path
print(df[['facility_name', 'reporting_year',
'total_cost_usd']].head())
Key fields (abbreviated)
| column | description |
| total_cost_usd | target – reported pollution‑control spend |
| prev_year_cost_usd | previous‑year cost |
| production_lbs | total production weight |
| onsite_release_lbs | TRI on‑site releases |
| recycle_capex_usd | Capital cost for recycling equipment |
| treatment_capex_usd | Capital cost for treatment equipment |
| energy_recovery_capex_usd | idem |
| state | two‑letter code |
| naics3 | three‑digit NAICS industry code |
| reporting_year | integer |
3. Basic cleaning & feature lists
- Converts state, naics3, and reporting_year into binary columns so the model can learn separate cost offsets for each region, industry, and program year.
- Puts all monetary and mass variables on the same statistical scale; Ridge’s L2 penalty then shrinks coefficients evenly rather than penalising only the biggest‑magnitude columns.
core_cols = ['total_cost_usd','prev_year_cost_usd','production_lbs',
'onsite_release_lbs','recycle_capex_usd',
'treatment_capex_usd','energy_recovery_capex_usd',
'state','naics3','reporting_year']
df = df.dropna(subset=core_cols).copy()
num_cols = ['prev_year_cost_usd','production_lbs','onsite_release_lbs',
'recycle_capex_usd','treatment_capex_usd',
'energy_recovery_capex_usd']
cat_cols = ['state','naics3','reporting_year']
target = 'total_cost_usd'
X = df[num_cols + cat_cols]
y = df[target]
4. Pre‑processing + Ridge‑CV pipeline
Runs a five‑fold cross‑validation over a grid of α values and stores the one that yields the lowest validation error.
preprocess = ColumnTransformer([
('cats', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('nums', StandardScaler(), num_cols)
])
ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100, 300], cv=5)
pipe = Pipeline([
('prep', preprocess),
('model', ridge)
])
5. Train–test split & fit
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
6. Evaluate hold‑out accuracy
pred = pipe.predict(X_test)
print(f"Selected α (L2 strength): {pipe.named_steps['model'].alpha_}")
print(f"R² on test set : {r2_score(y_test, pred):.3f}")
print(f"MAE on test set : ${mean_absolute_error(y_test, pred):,.0f}")
7. Coefficient insight
Coefficients remain in dollar units. Example: a +$2.1 M coefficient on production_lbs (per σ) quantifies the added cost of high‑volume operations, while a −$180 k coefficient on recycle_capex_usd shows capital spending on recycling tends to lower future expenses.
# rebuild full feature list
ohe = pipe.named_steps['prep'].named_transformers_['cats']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coefs = (pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
.sort_values())
print("\nLargest cost *reducers* (negative weights):")
print(coefs.head(8))
print("\nLargest cost *drivers* (positive weights):")
print(coefs.tail(8))
Numeric features were z‑scored, so each numeric coefficient is the USD change for a one‑standard‑deviation increase in that metric; each one‑hot flag is a dollar offset versus the reference group.
8. Persist the model
Recycling, treatment and energy‑recovery investments are correlated; OLS inflates their weights in opposite directions. Ridge stabilises the solution, improves generalisation and still keeps the model linear.
joblib.dump(pipe, "ridge_pollution_cost_model.pkl")
Summary
In about 120 lines of code, we transformed EPA TRI cost reports into an explainable pollution‑control cost forecaster:
- Practical benefit: budget officers can preview next‑year abatement expenses months before purchasing allowances or chemicals.
- Transparent levers: every variable’s dollar impact is clear, helping sustainability teams justify process‑efficiency projects.
- Benchmark: any future gradient‑boosted or Bayesian model must beat this Ridge MAE and remain just as defensible to auditors and regulators.