Environmental Compliance Cost Prediction with ElasticNet Algorithm in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Firms in regulated industries must estimate their annual environmental‑compliance cost (USD)—capital plus operating spend devoted to air, water, and solid‑waste controls—before budgets are approved. Historical survey data show that cost is driven by facility size, industry sector, production output, pollution‑intensity indices, year, and geographic region. Many of these variables are tightly collinear (larger plants ↔ higher output ↔ greater pollution intensity), so ordinary least‑squares produces unstable coefficients. At the same time, a pure Lasso (ℓ¹) model can over‑shrink and discard relevant factors. Elastic Net (a Ridge ℓ² + Lasso ℓ¹ penalty) balances stability and sparsity, yielding a transparent model that forecasts compliance cost from routinely collected facility metrics.

Libraries Required

Purpose	Library
Data handling	pandas, numpy
Visualisation	matplotlib, seaborn
ML workflow	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset

Pollution Abatement Costs and Expenditures

Step-by-Step Code Implementation

1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Load and explore data

df = pd.read_csv("pace_2005.csv")

# Example column names: ['Facility_ID','Industry_Sector','Region',
#                        'Employment','Value_Added_USD','Abatement_Capital_USD',
#                        'Abatement_Operating_USD','Year']
df['Compliance_Cost_USD'] = (df['Abatement_Capital_USD']
                              + df['Abatement_Operating_USD'])
print(df[['Industry_Sector','Compliance_Cost_USD']].head())

sns.histplot(df['Compliance_Cost_USD']/1e6, kde=True)
plt.xlabel('Compliance Cost (million USD)')
plt.title('Distribution of Environmental‑Compliance Cost')
plt.show()

3. Define target and features

y = df['Compliance_Cost_USD']              # target

X = df[['Industry_Sector','Region','Employment',
        'Value_Added_USD','Pollution_Intensity','Year']]

cat_cols = ['Industry_Sector','Region']
num_cols = ['Employment','Value_Added_USD','Pollution_Intensity','Year']

4. Build an Elastic Net pipeline

ColumnTransformer one‑hot‑encodes categorical fields (sector, region) and z‑scales numeric variables (employment, value‑added, intensity, year).
Running these steps inside each cross‑validation fold avoids information leakage.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
        ('num', StandardScaler(),                          num_cols)
    ])

pipe = Pipeline([
        ('prep', preprocess),
        ('enet', ElasticNet(max_iter=20_000, random_state=42))
    ])

5. Train/test split and hyper‑parameter tuning

alpha controls global shrinkage; larger values reduce variance but increase bias.
l1_ratio shifts the penalty between Ridge (0 = pure ℓ², good for multicollinearity) and Lasso (1 = pure ℓ¹, induces sparsity).
An 18 × 9 grid (162 models) with 5‑fold CV selects the mix with the lowest root‑mean‑squared error.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=df['Industry_Sector'])

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

gs = GridSearchCV(pipe, param_grid,
                  cv=5,
                  scoring='neg_root_mean_squared_error',
                  n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)

print("Best alpha:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])

6. Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

7. Interpret influential features

Coefficient analysis often shows, for example, that every extra 100 employees adds roughly $80 k to compliance spending, high‑intensity regions impose $1.2 M premiums, and each additional $10 M in value‑added boosts costs $150 k—insights procurement and EHS managers can use for benchmarking.

ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
feature_names = np.hstack([ohe.get_feature_names_out(cat_cols), num_cols])

# De‑scale numeric coefficients for real‑world units
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef   = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales

(pd.Series(coef, index=feature_names)
   .sort_values(key=abs, ascending=False)
   .head(15)
   .plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.xlabel('Δ Compliance Cost (USD)')
plt.title('Elastic Net – Key Cost Drivers')
plt.tight_layout()
plt.show()

Summary

With about 150 lines of Python, this Elastic Net workflow:

Predicts environmental‑compliance cost early, giving finance teams defensible numbers.
Handles multicollinearity and achieves sparsity, maintaining stability while highlighting the most critical features.
Quantifies dollar impacts of scale, sector, pollution intensity, and geography, guiding strategic investment in abatement technologies and regulatory negotiations.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook