Solar Energy Output Prediction using Stepwise Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

Solar farm owners and grid operators must predict photovoltaic (PV) power output to manage supply and demand and optimise storage dispatch. In this project, we will predict the hourly energy output of a solar installation using weather variables—such as temperature, humidity, wind speed, and solar irradiance — and temporal features like hour of day and day of year. By using stepwise regression, we’ll build an interpretable linear model that provides short‑term forecasts, helping with better operational planning and reducing curtailment risk.

Libraries Required

import pandas as pd               # Data manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Statistical modeling  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation  
import matplotlib.pyplot as plt   # Visualization

Dataset

Solar Output Prediction Using Weather Data

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We read the hourly solar output dataset, inspect its structure, and review summary statistics to understand the distributions of variables (temperature, humidity, wind speed, cloud cover, and output).

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/thedevastator/solar-output-prediction-using-weather-data/download"
df = pd.read_csv(url)

# Inspect first rows and structure
print(df.head())
print(df.info())
print(df.describe())

The dataset includes hourly records of weather (temperature, humidity, wind speed, cloud cover) and corresponding solar power output.

Data Preprocessing

We convert the timestamp to extract ‘hour’ and ‘day_of_year’, drop the original datetime, and remove missing entries. We then define the predictors (weather and temporal features) and the target (power_output), splitting the data into training and test sets.

# Block 2: Feature engineering and encoding
# Parse datetime and extract hour and day‑of‑year
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_year'] = df['datetime'].dt.dayofyear

# Drop original datetime and any NA rows
df = df.drop(columns=['datetime']).dropna()

# Separate predictors and target
X = df.drop('power_output', axis=1)
y = df['power_output']

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our custom function alternates between forward inclusion (adding predictors with p‑value < 0.01) and backward elimination (removing predictors with p‑value > 0.05) until convergence. This yields a concise set of variables.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out=0.05, 
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test each excluded variable
        excluded = list(set(X.columns) - set(included))
        pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            pvals[col] = model.pvalues[col]
        best_pval = pvals.min()
        if best_pval < threshold_in:
            best_var = pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.6f}")

        # Backward step: remove worst if necessary
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals_included = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals_included.max()
        if worst_pval > threshold_out:
            worst_var = pvals_included.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")

        if not changed:
            break
    return included

Model Building & Evaluation

Model Fitting: Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The summary provides coefficient estimates, p‑values, and fit statistics (R², AIC).
Evaluation: We predict on the test set and compute R² (explained variance) and RMSE (prediction error), quantifying how well the model generalises.

# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)

# Fit final model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict and evaluate
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A scatter plot of residuals versus predicted output checks for patterns or heteroscedasticity, validating OLS assumptions.

# Block 5: Residual analysis
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle='--')
plt.xlabel("Predicted Power Output")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Output")
plt.show()

Summary

Stepwise regression applied to solar generation data isolates the key drivers, likely irradiance, temperature, and time‑of‑day effects, while ignoring less important variables. The resulting linear model balances interpretability and accuracy, achieving strong test‑set performance (high R², low RMSE). Operators can leverage this model for reliable short‑term forecasting, leading to more efficient grid-integration and storage-dispatch decisions.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook