Factory Emission Trend Prediction with Polynomial Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Environmental planners and industrial compliance teams need to forecast the annual CO₂ emissions (million metric tons) from the Industry sector, based on correlated sectoral emission trends and time, to guide mitigation strategies and investment in cleaner technologies. Historical data show that industrial emissions are driven not only by time (reflecting technology shifts and regulation) but also by patterns in related sectors—such as Energy and Transport emissions. These relationships are nonlinear and may involve plateauing or acceleration phases. A plain linear model underfits these dynamics, while a naïve high‑degree polynomial overfits noise. By fitting a Polynomial Regression on engineered multi‑sector features with Ridge (ℓ²) regularisation, we can learn a smooth, interpretable emission‑trend model that generalizes to future years.
Dataset
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Load Data & Inspect
import pandas as pd
df = pd.read_csv("data/co2_emissions_by_sector.csv")
# Keep only the relevant columns and drop missing
df = df[['Year','Industry','Energy','Transport']].dropna()
df.head()
3. Exploratory Data Analysis
- Year captures time trends (technology, regulation).
- Energy & Transport emissions reflect correlated industrial activity.
import seaborn as sns
import matplotlib.pyplot as plt
# Plot Industry emissions over time
sns.lineplot(x='Year', y='Industry', data=df)
plt.title("Industry Sector CO₂ Emissions Over Time")
plt.ylabel("Emissions (Mt CO₂)")
plt.show()
# Check pairwise trends
sns.pairplot(df, vars=['Industry','Energy','Transport'], kind='scatter', diag_kind='kde')
plt.suptitle("Sector Emission Relationships", y=1.02)
plt.show()
4. Define Features & Target
Expands inputs into powers and interactions (e.g. Year², Energy×Transport) to model curvature and synergy in emission trends.
# Use Year, Energy, and Transport emissions to predict Industry emissions X = df[['Year','Energy','Transport']] y = df['Industry']
5. Build a Polynomial Regression Pipeline
StandardScaler normalises each feature, so regularisation treats them equally.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
6. Train/Test Split & Hyperparameter Search
- Tunes the polynomial degree (1–3) and regularisation strength (10⁻³–10³) via 5‑fold CV, optimising for lowest RMSE on held‑out data
- Applies an ℓ² penalty (controlled by alpha) to shrink noisy high‑order coefficients, mitigating overfitting.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best polynomial degree:", gs.best_params_['poly__degree'])
print("Best Ridge alpha :", gs.best_params_['ridge__alpha'])
7. Evaluate Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} Mt CO₂")
print(f"Test R² : {r2:.3f}")
8. Inspect Key Polynomial Coefficients
The most significant coefficients—such as Energy² or Year×Transport—highlight which nonlinear and interactive effects most drive industrial emissions, guiding policy levers.
# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=['Year','Energy','Transport'])
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
imp = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False).head(10)
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
imp.plot(kind='barh')
plt.xlabel("Coefficient Magnitude")
plt.title("Top Polynomial Features Influencing Industry Emissions")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression pipeline with Ridge regularisation provides a robust, interpretable framework to forecast industry CO₂ emissions:
- Captures nonlinear trends over time and cross‑sector interactions without overfitting.
- Balances complexity and generalisation via hyperparameter tuning of degree and α.
- Yields actionable insights: key polynomial features inform which sectoral and temporal factors most influence industrial emissions, supporting targeted mitigation strategies.