Solar Output Prediction using Quantile Regression in ML

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

Grid operators and plant managers need not only an average forecast of photovoltaic (PV) output but also an understanding of variability—for example, what power levels to expect in unusually cloudy (10th percentile) or exceptionally sunny (90th percentile) conditions.

In the solar output prediction ML project, we’ll predict multiple quantiles (10th, 50th, 90th percentiles) of hourly solar plant output (kW) based on weather and system variables (ambient temperature, module temperature, solar irradiation, wind speed).

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf    # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss

Dataset

Solar energy power generation dataset

Step-by-Step Code Implementation

Load & Inspect Data

We read an hourly‐resolution PV output dataset—including ambient and module temperatures, irradiation, wind speed, and measured power.
Then inspect with .info() and .describe() to verify no unexpected missingness.

# Load the solar power generation dataset (hourly records)
# Source: “Solar energy power generation dataset” on Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("solar-energy-power-generation-dataset.csv")

# Peek at the first rows and check types
print(df.head())
print(df.info())
print(df.describe())

Preprocessing & Feature Selection

If a timestamp column exists, we parse it and set it as the DataFrame index for potential time‐based extensions.
We drop any rows missing core fields to ensure clean model fitting.
We select four meteorological and system variables (features) as predictors and rename the measured output column to Power for brevity.

# Parse timestamp if present and drop any fully empty rows
if 'DATE_TIME' in df.columns:
    df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
    df.set_index('DATE_TIME', inplace=True)

# Select relevant predictors and the target
# Typical columns: 'Ambient_Temperature', 'Module_Temperature', 'Irradiation', 'Wind_Speed', 'Power_kW'
df = df.dropna(subset=[
    'Ambient_Temperature',
    'Module_Temperature',
    'Irradiation',
    'Wind_Speed',
    'Power_kW'
])

features = [
    'Ambient_Temperature',
    'Module_Temperature',
    'Irradiation',
    'Wind_Speed'
]
data = df[features + ['Power_kW']].copy()
data.rename(columns={'Power_kW': 'Power'}, inplace=True)

Train/Test Split

Using an 80/20 random split, we reserve 20% of observations for out‐of‐sample evaluation.
Thus, ensuring our quantile models generalize to unseen weather conditions.

# Reserve 20% of the data for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each target quantile (10th, 50th, 90th):

We construct a formula string (“Power ~ Ambient_Temperature + Module_Temperature + Irradiation + Wind_Speed”).
We fit a QuantReg model at that percentile on the training set.
We extract and print the coefficient table (.tables[1]), revealing each predictor’s marginal effect at that part of the output distribution (e.g., how a 1 kW/m² increase in irradiation affects low‐versus‐high‐output conditions differently).

quantiles = [0.1, 0.5, 0.9]
results   = {}
formula   = "Power ~ " + " + ".join(features)

for q in quantiles:
    model = smf.quantreg(formula, train)
    res   = model.fit(q=q)
    results[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])

Evaluation with Pinball Loss

We predict on the held‐out test set and compute pinball loss—the appropriate scoring rule for quantile forecasts—for each quantile. Lower pinball loss indicates better alignment of predicted and realized quantiles, allowing comparison of model performance across the distribution.

for q, res in results.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Power'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

By applying quantile regression to solar PV data, we capture the differential impacts of environmental conditions on low, median, and high power outputs.

For instance, module temperature might depress the 90th‐percentile output more than the median.

These tail‐specific insights equip grid operators and plant engineers with distribution‐aware forecasts: provisioning buffer capacity for cloudy periods (10th percentile), planning routine operations around typical output (50th percentile), and optimizing storage or curtailment strategies during peak generation (90th percentile).

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook