Solar Energy Output Prediction using Stepwise Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Solar farm owners and grid operators must predict photovoltaic (PV) power output to manage supply and demand and optimise storage dispatch. In this project, we will predict the hourly energy output of a solar installation using weather variables—such as temperature, humidity, wind speed, and solar irradiance — and temporal features like hour of day and day of year. By using stepwise regression, we’ll build an interpretable linear model that provides short‑term forecasts, helping with better operational planning and reducing curtailment risk.
Libraries Required
import pandas as pd # Data manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Statistical modeling from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation import matplotlib.pyplot as plt # Visualization
Dataset
Solar Output Prediction Using Weather Data
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We read the hourly solar output dataset, inspect its structure, and review summary statistics to understand the distributions of variables (temperature, humidity, wind speed, cloud cover, and output).
# Block 1: Load dataset url = "https://www.kaggle.com/datasets/thedevastator/solar-output-prediction-using-weather-data/download" df = pd.read_csv(url) # Inspect first rows and structure print(df.head()) print(df.info()) print(df.describe())
The dataset includes hourly records of weather (temperature, humidity, wind speed, cloud cover) and corresponding solar power output.
Data Preprocessing
We convert the timestamp to extract ‘hour’ and ‘day_of_year’, drop the original datetime, and remove missing entries. We then define the predictors (weather and temporal features) and the target (power_output), splitting the data into training and test sets.
# Block 2: Feature engineering and encoding
# Parse datetime and extract hour and day‑of‑year
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_year'] = df['datetime'].dt.dayofyear
# Drop original datetime and any NA rows
df = df.drop(columns=['datetime']).dropna()
# Separate predictors and target
X = df.drop('power_output', axis=1)
y = df['power_output']
# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
Our custom function alternates between forward inclusion (adding predictors with p‑value < 0.01) and backward elimination (removing predictors with p‑value > 0.05) until convergence. This yields a concise set of variables.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test each excluded variable
excluded = list(set(X.columns) - set(included))
pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
pvals[col] = model.pvalues[col]
best_pval = pvals.min()
if best_pval < threshold_in:
best_var = pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.6f}")
# Backward step: remove worst if necessary
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals_included = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals_included.max()
if worst_pval > threshold_out:
worst_var = pvals_included.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The summary provides coefficient estimates, p‑values, and fit statistics (R², AIC).
- Evaluation: We predict on the test set and compute R² (explained variance) and RMSE (prediction error), quantifying how well the model generalises.
# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)
# Fit final model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict and evaluate
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A scatter plot of residuals versus predicted output checks for patterns or heteroscedasticity, validating OLS assumptions.
# Block 5: Residual analysis
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle='--')
plt.xlabel("Predicted Power Output")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Output")
plt.show()
Summary
Stepwise regression applied to solar generation data isolates the key drivers, likely irradiance, temperature, and time‑of‑day effects, while ignoring less important variables. The resulting linear model balances interpretability and accuracy, achieving strong test‑set performance (high R², low RMSE). Operators can leverage this model for reliable short‑term forecasting, leading to more efficient grid-integration and storage-dispatch decisions.