Biomass Energy Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Biomass‐based power plants and bioenergy facilities incur varying generation costs depending on feedstock price, plant capacity, conversion efficiency, and regional socio‑economic factors. Accurate cost forecasts—measured in USD per MWh—help operators negotiate feedstock contracts, optimize plant dispatch, and assess project viability.
In the Biomass Energy Cost Prediction ML project, we will predict the generation cost per MWh of biomass energy using features such as feedstock price, plant capacity, thermal efficiency, capacity factor, and local labor cost index.
We’ll also identify the most influential cost drivers and, using stepwise regression, build an interpretable linear model that helps stakeholders make informed investments and decisions.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Global Renewable Energy and Indicators Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a comprehensive renewables dataset that includes biomass generation costs and relevant plant‐ and region‐level indicators
# Block 1: Load dataset
# We’ll use a global renewables dataset that includes biomass cost and indicators :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/anishvijay/global-renewable-energy-and-indicators-dataset/download"
df = pd.read_csv(url)
# Inspect
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
We filter to non‐missing records, one‑hot encode the Region categorical variable, and define our features (X) and target (y).
Thus, an 80/20 train/test split readies data for modeling.
# Block 2: Clean & select relevant columns
# Assume the dataset contains columns: 'Biomass_Generation_Cost_USD_per_MWh',
# 'Feedstock_Price_USD_per_tonne', 'Plant_Capacity_MW', 'Thermal_Efficiency_pct',
# 'Capacity_Factor_pct', 'Labor_Cost_Index', 'Region', 'Year'
# Drop rows with missing key data
df = df.dropna(subset=[
'Biomass_Generation_Cost_USD_per_MWh',
'Feedstock_Price_USD_per_tonne',
'Plant_Capacity_MW',
'Thermal_Efficiency_pct',
'Capacity_Factor_pct',
'Labor_Cost_Index'
])
# One‑hot encode region
df_enc = pd.get_dummies(df, columns=['Region'], drop_first=True)
# Define predictors and target
X = df_enc.drop('Biomass_Generation_Cost_USD_per_MWh', axis=1)
y = df_enc['Biomass_Generation_Cost_USD_per_MWh']
# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
Our custom function iteratively adds the excluded predictor with the lowest p‑value (< 0.01) and removes the included predictor with the highest p‑value (> 0.05) until convergence, as a result yielding a parsimonious set of cost drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test each excluded feature
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.4f}")
# Backward step: test each included feature
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Using the selected features, we fit an Ordinary Least Squares regression via statsmodels.
- The .summary() output reports coefficients, p‑values, R², adjusted R², and diagnostic metrics—offering transparent insight into each variable’s impact on cost.
- On the held‑out test set, we compute R² (explained variance) and RMSE (prediction error) to quantify model performance and generalization.
# Block 4: Feature selection
selected = stepwise_selection(X_train, y_train)
# Fit final model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
# Performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residuals vs. predicted plot checks for heteroscedasticity or non‑random patterns, validating OLS assumptions and model reliability.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost (USD/MWh)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Biomass Cost")
plt.show()
Summary
Applying stepwise regression to biomass energy data distills the key drivers of generation cost—such as feedstock price, capacity factor, and thermal efficiency—while trimming non‑informative variables.
In addition, the resulting linear model combines clarity (few, statistically significant predictors) with strong predictive power (high test‐set R², low RMSE), equipping bioenergy project planners and operators with a transparent forecasting tool to optimize costs and guide strategic investment.