Solar Energy Output Prediction using Linear Regression in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
A photovoltaic (PV) farm’s revenue depends on how much electricity its panels actually produce hour by hour. Grid operators schedule reserve capacity based on those forecasts, and plant managers decide when to clean modules or tilt trackers.
In this mini-project, we build a linear-regression baseline that predicts the instantaneous DC power output (kW) of a utility-scale solar plant from three weather-station readings—ambient temperature, module temperature, and plane-of-array irradiation. The model is purposely simple and fully transparent, giving engineers a first‑order estimate and a clear view of how each environmental factor pushes production up or down.
Libraries Required
- pandas # data loading & joins
- numpy # numeric helpers
- matplotlib.pyplot # quick sanity plots
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression first? Within a normal operating range, the relationship between plane‑of‑array irradiation and panel output is close to linear; temperature adds second‑order derating. A straight‑line fit captures those effects cleanly and exposes their exact strength in the coefficient table.
1. Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load generation & weather data
The Kaggle dataset ships as four CSVs (PLANT[1|2]_Generation_Data.csv and PLANT[1|2]_Weather_Sensor_Data.csv).
We’ll work with Plant 1 for clarity; repeat the same steps for Plant 2 if desired.
gen = pd.read_csv("PLANT1_Generation_Data.csv")
weath = pd.read_csv("PLANT1_Weather_Sensor_Data.csv")
3. Join logic
Generation rows arrive at five‑minute intervals per inverter; the weather station logs at the same cadence. An inner join on DATE_TIME keeps only timestamps where both readings exist, preventing label leakage from forward‑filled weather values.
gen['DATE_TIME'] = pd.to_datetime(gen['DATE_TIME']) weath['DATE_TIME'] = pd.to_datetime(weath['DATE_TIME']) # merge the two tables on plant/inverter/timestamp df = gen.merge(weath, on=['PLANT_ID', 'DATE_TIME'], how='inner')
4. Select predictors & target
A positive IRRADIATION coefficient confirms that more sun means more power. A negative MODULE_TEMPERATURE coefficient (if present) quantifies thermal losses—each extra degree knocks off a predictable slice of output.
features = ['AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION'] target = 'DC_POWER' # power fed by inverter before AC conversion X = df[features] y = df[target]
5. Pipeline: scaling + linear model
numeric_transformer = Pipeline([('scale', StandardScaler())])
preprocess = ColumnTransformer(
transformers=[('num', numeric_transformer, features)],
remainder='drop')
linreg = LinearRegression()
pipe = Pipeline(steps=[('prep', preprocess),
('model', linreg)])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Metrics chosen
R² answers “How much of the output variance do we explain?”, while MAE in kW tells plant operators the average absolute error—easy to translate into missed revenue or curtailment.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} kW")
8. Standard scaling
Standard scaling puts the three predictors on comparable footing (kW per σ), so the magnitude of each coefficient directly shows relative importance.
coeffs = pd.Series(linreg.coef_, index=features)\
.sort_values(ascending=False)
print("\nImpact of each weather variable on DC power (kW per std‑dev):")
print(coeffs)
9. Pipeline persistence
Pipeline persistence – Saving the scaler and the regression weights in one .pkl ensures the model treats tomorrow’s SCADA feed exactly like today’s training data.
joblib.dump(pipe, "solar_dc_power_linreg.pkl")
Summary
In around sixty lines of code, we turned raw SCADA and weather logs into a live solar‑output forecaster. The linear model’s clarity helps operations staff grasp how sunlight and temperature tug daily production, while its predictions serve as a quick baseline for scheduling maintenance or bidding into the day‑ahead market. Once this interpretable benchmark is in place, you can layer in time‑lags, cloud‑cover imagery, or gradient‑boosting to shave additional kilowatt‑hours off the error band—confident you’re really improving on a solid starting point.