Factory Emission Trend Prediction with Polynomial Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Environmental planners and industrial compliance teams need to forecast the annual CO₂ emissions (million metric tons) from the Industry sector, based on correlated sectoral emission trends and time, to guide mitigation strategies and investment in cleaner technologies. Historical data show that industrial emissions are driven not only by time (reflecting technology shifts and regulation) but also by patterns in related sectors—such as Energy and Transport emissions. These relationships are nonlinear and may involve plateauing or acceleration phases. A plain linear model underfits these dynamics, while a naïve high‑degree polynomial overfits noise. By fitting a Polynomial Regression on engineered multi‑sector features with Ridge (ℓ²) regularisation, we can learn a smooth, interpretable emission‑trend model that generalizes to future years.

Dataset

CO₂ Emissions by Sector

Step-by-Step Code Implementation

1. Libraries Required

import pandas as pd                       # data loading & handling  
import numpy as np                        # numerical operations  

import matplotlib.pyplot as plt           # plotting  
import seaborn as sns                     # enhanced visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

 2. Load Data & Inspect

import pandas as pd

df = pd.read_csv("data/co2_emissions_by_sector.csv")
# Keep only the relevant columns and drop missing
df = df[['Year','Industry','Energy','Transport']].dropna()
df.head()

3. Exploratory Data Analysis

  • Year captures time trends (technology, regulation).
  • Energy & Transport emissions reflect correlated industrial activity.
import seaborn as sns
import matplotlib.pyplot as plt

# Plot Industry emissions over time
sns.lineplot(x='Year', y='Industry', data=df)
plt.title("Industry Sector CO₂ Emissions Over Time")
plt.ylabel("Emissions (Mt CO₂)")
plt.show()

# Check pairwise trends
sns.pairplot(df, vars=['Industry','Energy','Transport'], kind='scatter', diag_kind='kde')
plt.suptitle("Sector Emission Relationships", y=1.02)
plt.show()

4. Define Features & Target

Expands inputs into powers and interactions (e.g. Year², Energy×Transport) to model curvature and synergy in emission trends.

# Use Year, Energy, and Transport emissions to predict Industry emissions
X = df[['Year','Energy','Transport']]
y = df['Industry']

5. Build a Polynomial Regression Pipeline

StandardScaler normalises each feature, so regularisation treats them equally.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ('scale', StandardScaler()),  
    ('poly', PolynomialFeatures(include_bias=False)),  
    ('ridge', Ridge(random_state=42))
])

6. Train/Test Split & Hyperparameter Search

  • Tunes the polynomial degree (1–3) and regularisation strength (10⁻³–10³) via 5‑fold CV, optimising for lowest RMSE on held‑out data
  • Applies an ℓ² penalty (controlled by alpha) to shrink noisy high‑order coefficients, mitigating overfitting.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best polynomial degree:", gs.best_params_['poly__degree'])
print("Best Ridge alpha      :", gs.best_params_['ridge__alpha'])

7. Evaluate Model

from sklearn.metrics import mean_squared_error, r2_score

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.2f} Mt CO₂")
print(f"Test R²  : {r2:.3f}")

 8. Inspect Key Polynomial Coefficients

The most significant coefficients—such as Energy² or Year×Transport—highlight which nonlinear and interactive effects most drive industrial emissions, guiding policy levers.

# Retrieve feature names after polynomial expansion
poly   = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=['Year','Energy','Transport'])
coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
imp = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False).head(10)

import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
imp.plot(kind='barh')
plt.xlabel("Coefficient Magnitude")
plt.title("Top Polynomial Features Influencing Industry Emissions")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Summary

This Polynomial Regression pipeline with Ridge regularisation provides a robust, interpretable framework to forecast industry CO₂ emissions:

  1. Captures nonlinear trends over time and cross‑sector interactions without overfitting.
  2. Balances complexity and generalisation via hyperparameter tuning of degree and α.
  3. Yields actionable insights: key polynomial features inform which sectoral and temporal factors most influence industrial emissions, supporting targeted mitigation strategies.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *