Environmental Compliance Cost Prediction with ElasticNet Algorithm in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
Firms in regulated industries must estimate their annual environmental‑compliance cost (USD)—capital plus operating spend devoted to air, water, and solid‑waste controls—before budgets are approved. Historical survey data show that cost is driven by facility size, industry sector, production output, pollution‑intensity indices, year, and geographic region. Many of these variables are tightly collinear (larger plants ↔ higher output ↔ greater pollution intensity), so ordinary least‑squares produces unstable coefficients. At the same time, a pure Lasso (ℓ¹) model can over‑shrink and discard relevant factors. Elastic Net (a Ridge ℓ² + Lasso ℓ¹ penalty) balances stability and sparsity, yielding a transparent model that forecasts compliance cost from routinely collected facility metrics.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset
Pollution Abatement Costs and Expenditures
Step-by-Step Code Implementation
1. Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Load and explore data
df = pd.read_csv("pace_2005.csv")
# Example column names: ['Facility_ID','Industry_Sector','Region',
# 'Employment','Value_Added_USD','Abatement_Capital_USD',
# 'Abatement_Operating_USD','Year']
df['Compliance_Cost_USD'] = (df['Abatement_Capital_USD']
+ df['Abatement_Operating_USD'])
print(df[['Industry_Sector','Compliance_Cost_USD']].head())
sns.histplot(df['Compliance_Cost_USD']/1e6, kde=True)
plt.xlabel('Compliance Cost (million USD)')
plt.title('Distribution of Environmental‑Compliance Cost')
plt.show()
3. Define target and features
y = df['Compliance_Cost_USD'] # target
X = df[['Industry_Sector','Region','Employment',
'Value_Added_USD','Pollution_Intensity','Year']]
cat_cols = ['Industry_Sector','Region']
num_cols = ['Employment','Value_Added_USD','Pollution_Intensity','Year']
4. Build an Elastic Net pipeline
- ColumnTransformer one‑hot‑encodes categorical fields (sector, region) and z‑scales numeric variables (employment, value‑added, intensity, year).
- Running these steps inside each cross‑validation fold avoids information leakage.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20_000, random_state=42))
])
5. Train/test split and hyper‑parameter tuning
- alpha controls global shrinkage; larger values reduce variance but increase bias.
- l1_ratio shifts the penalty between Ridge (0 = pure ℓ², good for multicollinearity) and Lasso (1 = pure ℓ¹, induces sparsity).
- An 18 × 9 grid (162 models) with 5‑fold CV selects the mix with the lowest root‑mean‑squared error.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['Industry_Sector'])
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
print("Best alpha:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])
6. Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
7. Interpret influential features
Coefficient analysis often shows, for example, that every extra 100 employees adds roughly $80 k to compliance spending, high‑intensity regions impose $1.2 M premiums, and each additional $10 M in value‑added boosts costs $150 k—insights procurement and EHS managers can use for benchmarking.
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
feature_names = np.hstack([ohe.get_feature_names_out(cat_cols), num_cols])
# De‑scale numeric coefficients for real‑world units
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales
(pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False)
.head(15)
.plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.xlabel('Δ Compliance Cost (USD)')
plt.title('Elastic Net – Key Cost Drivers')
plt.tight_layout()
plt.show()
Summary
With about 150 lines of Python, this Elastic Net workflow:
- Predicts environmental‑compliance cost early, giving finance teams defensible numbers.
- Handles multicollinearity and achieves sparsity, maintaining stability while highlighting the most critical features.
- Quantifies dollar impacts of scale, sector, pollution intensity, and geography, guiding strategic investment in abatement technologies and regulatory negotiations.