Wind Energy Quantile Prediction using Quantile Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Grid operators and renewable energy planners require not only an average forecast of wind farm output but also an understanding of the range of possible generation levels—anticipating both low‐output periods (10th percentile) and peak production events (90th percentile).
In this project, we will predict the 10th, 50th, and 90th percentiles of hourly wind power output (MW) at a set of German transmission system operators using historical SCADA data (wind speed, wind direction, temperature, and prior output). By fitting separate quantile regression models, we’ll uncover how each meteorological and operational factor drives low‐, median‐, and high‐output scenarios—enabling robust grid balancing and more resilient integration of wind energy.
Libraries Required
import pandas as pd import numpy as np import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts import matplotlib.pyplot as plt # Visualization of residuals
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We ingest SCADA‐style records—hourly power output (Power in MW), wind speed (WindSpeed m/s), wind direction (WindDirection degrees), and air temperature (AirTemp °C)—from four major German TSOs (Kaggle). Initial .info() and .describe() calls verify data types, ranges, and any missingness.
# Load the “Wind Power Generation” SCADA dataset for four German TSOs :contentReference[oaicite:1]{index=1}
df = pd.read_csv("wind-power-generation.csv")
# Quick inspection
print(df.head())
print(df.info())
print(df[['Power']].describe())
Preprocessing & Feature Engineering
- We drop incomplete records for model integrity.
- We add a lagged output feature (Power_lag1) to capture inertia in generation dynamics.
- We define features as our predictor set and rename the target column to Output for clarity in formulas.
# Drop rows with missing core variables
df = df.dropna(subset=['Power','WindSpeed','WindDirection','AirTemp'])
# (Optional) Create a lag feature: prior hour’s power output
df['Power_lag1'] = df['Power'].shift(1).fillna(method='bfill')
# Define predictors and target
features = ['WindSpeed','WindDirection','AirTemp','Power_lag1']
df_model = df[features + ['Power']].copy()
df_model.rename(columns={'Power':'Output'}, inplace=True)
Train/Test Split
We randomly hold out 20% of the data for evaluation, ensuring our quantile models generalize to unseen weather and operational conditions.
# Reserve 20% of the data for out‑of‑sample evaluation train, test = train_test_split(df_model, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each quantile (10th, 50th, 90th percentiles):
- We build a statsmodels formula (“Output ~ WindSpeed + WindDirection + AirTemp + Power_lag1”).
- We fit a QuantReg model on the training set at that percentile.
- We print only the coefficient table—showing how each predictor’s effect on generation shifts across low, median, and high output levels (e.g., wind speed may have a stronger marginal effect on the 90th percentile).
quantiles = [0.10, 0.50, 0.90]
results = {}
formula = "Output ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # coefficients only
Evaluation with Pinball Loss
- We produce quantile‐specific output forecasts on the test set.
- We compute pinball loss for each quantile—a proper scoring rule for quantile estimates—quantifying the asymmetric penalty for under‑ versus over‑prediction at each percentile. Lower pinball loss means a better‐calibrated quantile model.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Output'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Residual Diagnostics (Example for Median)
# Plot residuals for the 50th‑percentile model
median_res = results[0.50]
preds_med = median_res.predict(test[features])
resid_med = test['Output'] - preds_med
plt.scatter(preds_med, resid_med, alpha=0.3)
plt.axhline(0, linestyle='--')
plt.xlabel("Predicted Median Output (MW)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted (50th Percentile)")
plt.show()
Summary
By employing quantile regression on wind SCADA data, we obtain distribution‑aware forecasts of wind farm output:
- The 10th‑percentile model prepares operators for low‐wind scenarios, guiding reserve capacity planning.
- The median (50th‑percentile) model provides typical output expectations for routine balancing.
- The 90th‑percentile model anticipates peak generation events, informing grid‐injection strategies and curtailment decisions.
These tailored quantile forecasts empower grid operators and renewable planners with robust tools to manage variability, optimize storage dispatch, and improve the reliability of integrating wind energy into the power system.