Solar Output Prediction using Quantile Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Grid operators and plant managers need not only an average forecast of photovoltaic (PV) output but also an understanding of variability—for example, what power levels to expect in unusually cloudy (10th percentile) or exceptionally sunny (90th percentile) conditions.
In the solar output prediction ML project, we’ll predict multiple quantiles (10th, 50th, 90th percentiles) of hourly solar plant output (kW) based on weather and system variables (ambient temperature, module temperature, solar irradiation, wind speed).
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss
Dataset
Solar energy power generation dataset
Step-by-Step Code Implementation
Load & Inspect Data
- We read an hourly‐resolution PV output dataset—including ambient and module temperatures, irradiation, wind speed, and measured power.
- Then inspect with .info() and .describe() to verify no unexpected missingness.
# Load the solar power generation dataset (hourly records)
# Source: “Solar energy power generation dataset” on Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("solar-energy-power-generation-dataset.csv")
# Peek at the first rows and check types
print(df.head())
print(df.info())
print(df.describe())
Preprocessing & Feature Selection
- If a timestamp column exists, we parse it and set it as the DataFrame index for potential time‐based extensions.
- We drop any rows missing core fields to ensure clean model fitting.
- We select four meteorological and system variables (features) as predictors and rename the measured output column to Power for brevity.
# Parse timestamp if present and drop any fully empty rows
if 'DATE_TIME' in df.columns:
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
df.set_index('DATE_TIME', inplace=True)
# Select relevant predictors and the target
# Typical columns: 'Ambient_Temperature', 'Module_Temperature', 'Irradiation', 'Wind_Speed', 'Power_kW'
df = df.dropna(subset=[
'Ambient_Temperature',
'Module_Temperature',
'Irradiation',
'Wind_Speed',
'Power_kW'
])
features = [
'Ambient_Temperature',
'Module_Temperature',
'Irradiation',
'Wind_Speed'
]
data = df[features + ['Power_kW']].copy()
data.rename(columns={'Power_kW': 'Power'}, inplace=True)
Train/Test Split
- Using an 80/20 random split, we reserve 20% of observations for out‐of‐sample evaluation.
- Thus, ensuring our quantile models generalize to unseen weather conditions.
# Reserve 20% of the data for evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each target quantile (10th, 50th, 90th):
- We construct a formula string (“Power ~ Ambient_Temperature + Module_Temperature + Irradiation + Wind_Speed”).
- We fit a QuantReg model at that percentile on the training set.
- We extract and print the coefficient table (.tables[1]), revealing each predictor’s marginal effect at that part of the output distribution (e.g., how a 1 kW/m² increase in irradiation affects low‐versus‐high‐output conditions differently).
quantiles = [0.1, 0.5, 0.9]
results = {}
formula = "Power ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1])
Evaluation with Pinball Loss
We predict on the held‐out test set and compute pinball loss—the appropriate scoring rule for quantile forecasts—for each quantile. Lower pinball loss indicates better alignment of predicted and realized quantiles, allowing comparison of model performance across the distribution.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Power'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
By applying quantile regression to solar PV data, we capture the differential impacts of environmental conditions on low, median, and high power outputs.
For instance, module temperature might depress the 90th‐percentile output more than the median.
These tail‐specific insights equip grid operators and plant engineers with distribution‐aware forecasts: provisioning buffer capacity for cloudy periods (10th percentile), planning routine operations around typical output (50th percentile), and optimizing storage or curtailment strategies during peak generation (90th percentile).