Biomass Energy Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Biomass‐based power plants and bioenergy facilities incur varying generation costs depending on feedstock price, plant capacity, conversion efficiency, and regional socio‑economic factors. Accurate cost forecasts—measured in USD per MWh—help operators negotiate feedstock contracts, optimize plant dispatch, and assess project viability.

In the Biomass Energy Cost Prediction ML project, we will predict the generation cost per MWh of biomass energy using features such as feedstock price, plant capacity, thermal efficiency, capacity factor, and local labor cost index.

We’ll also identify the most influential cost drivers and, using stepwise regression, build an interpretable linear model that helps stakeholders make informed investments and decisions.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization  

Dataset

Global Renewable Energy and Indicators Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a comprehensive renewables dataset that includes biomass generation costs and relevant plant‐ and region‐level indicators

# Block 1: Load dataset
# We’ll use a global renewables dataset that includes biomass cost and indicators :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/anishvijay/global-renewable-energy-and-indicators-dataset/download"
df = pd.read_csv(url)

# Inspect
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We filter to non‐missing records, one‑hot encode the Region categorical variable, and define our features (X) and target (y).

Thus, an 80/20 train/test split readies data for modeling.

# Block 2: Clean & select relevant columns
# Assume the dataset contains columns: 'Biomass_Generation_Cost_USD_per_MWh',
# 'Feedstock_Price_USD_per_tonne', 'Plant_Capacity_MW', 'Thermal_Efficiency_pct',
# 'Capacity_Factor_pct', 'Labor_Cost_Index', 'Region', 'Year'

# Drop rows with missing key data
df = df.dropna(subset=[
    'Biomass_Generation_Cost_USD_per_MWh',
    'Feedstock_Price_USD_per_tonne',
    'Plant_Capacity_MW',
    'Thermal_Efficiency_pct',
    'Capacity_Factor_pct',
    'Labor_Cost_Index'
])

# One‑hot encode region
df_enc = pd.get_dummies(df, columns=['Region'], drop_first=True)

# Define predictors and target
X = df_enc.drop('Biomass_Generation_Cost_USD_per_MWh', axis=1)
y = df_enc['Biomass_Generation_Cost_USD_per_MWh']

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our custom function iteratively adds the excluded predictor with the lowest p‑value (< 0.01) and removes the included predictor with the highest p‑value (> 0.05) until convergence, as a result yielding a parsimonious set of cost drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test each excluded feature
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.4f}")

        # Backward step: test each included feature
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

  • Using the selected features, we fit an Ordinary Least Squares regression via statsmodels.
  • The .summary() output reports coefficients, p‑values, R², adjusted R², and diagnostic metrics—offering transparent insight into each variable’s impact on cost.
  • On the held‑out test set, we compute R² (explained variance) and RMSE (prediction error) to quantify model performance and generalization.
# Block 4: Feature selection
selected = stepwise_selection(X_train, y_train)

# Fit final model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)

# Performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residuals vs. predicted plot checks for heteroscedasticity or non‑random patterns, validating OLS assumptions and model reliability.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost (USD/MWh)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Biomass Cost")
plt.show()

Summary

Applying stepwise regression to biomass energy data distills the key drivers of generation cost—such as feedstock price, capacity factor, and thermal efficiency—while trimming non‑informative variables.

In addition, the resulting linear model combines clarity (few, statistically significant predictors) with strong predictive power (high test‐set R², low RMSE), equipping bioenergy project planners and operators with a transparent forecasting tool to optimize costs and guide strategic investment.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *