Air Pollution Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Air pollution harms public health and the environment and also increases expenditure—whether on healthcare, productivity losses, or energy bills (e.g., HVAC usage). In this project, we will predict the daily electricity cost (as a proxy for pollution‑related energy demand) based on air quality indicators (PM₂.₅, PM₁₀, AQI), meteorological variables (temperature, humidity), and temporal features (weekday/weekend). By applying stepwise regression, we’ll isolate the key drivers impacting cost and develop a concise linear model that balances interpretability with predictive accuracy—helping policymakers and utilities understand and anticipate pollution‑induced cost fluctuations.

Libraries Required

import pandas as pd               # Data manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Statistical modeling  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation  
import matplotlib.pyplot as plt   # Visualization

Dataset

Electricity Cost Prediction Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We read daily records containing electricity_cost, air quality measures (AQI, PM2.5, PM10), and weather variables. Initial exploration (.info(), .describe()) reveals the data types and basic statistics.

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/shalmamuji/electricity-cost-prediction-dataset/download"
df = pd.read_csv(url)

# Inspect data
print(df.head())
print(df.info())
print(df.describe())

The dataset includes daily records of electricity cost alongside AQI, PM₂.₅, PM₁₀, temperature, and humidity.

Data Preprocessing

We parse the date column to flag weekends (when electricity patterns differ), drop the original timestamp, and remove missing entries. Features (X) include pollution and weather; target (y) is electricity_cost. We then perform an 80/20 train‑test split.

# Block 2: Feature engineering
# Convert date column to datetime and extract weekday/weekend flag
df['date'] = pd.to_datetime(df['date'])
df['is_weekend'] = df['date'].dt.weekday.isin([5,6]).astype(int)

# Drop original date column and any NA rows
df = df.drop(columns=['date']).dropna()

# Define predictors and target
X = df.drop('electricity_cost', axis=1)
y = df['electricity_cost']

# Train–test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function alternates between forward inclusion—adding the excluded predictor with the lowest p‑value below 0.01—and backward elimination—dropping included predictors with p-values above 0.05—until convergence to an optimal feature set.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out=0.05, 
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test adding each excluded variable
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.6f}")

        # Backward step: test removing each included variable
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Model Fitting: Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The summary provides coefficient estimates, p‑values, R², and F‑statistics, enabling interpretation of each predictor’s impact.
Evaluation: We generate predictions on the held‑out test set and compute R² (variance explained) and RMSE (root-mean-square error) to gauge model performance.

# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

Plotting residuals versus predicted costs can detect non‑random patterns or heteroscedasticity, thereby validating linear regression assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Electricity Cost")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression, we distil the most significant drivers of pollution‑related electricity cost—likely including AQI, PM₂.₅, temperature, and weekend effects—into a compact linear model. This approach achieves a strong balance of interpretability and predictive accuracy (high test‑set R², low RMSE).

Utilities and policymakers can use these insights to anticipate cost fluctuations related to air quality, plan demand response, and design interventions for peak-pollution events.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook