Solar Energy Output Prediction using Linear Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

A photovoltaic (PV) farm’s revenue depends on how much electricity its panels actually produce hour by hour. Grid operators schedule reserve capacity based on those forecasts, and plant managers decide when to clean modules or tilt trackers.

In this mini-project, we build a linear-regression baseline that predicts the instantaneous DC power output (kW) of a utility-scale solar plant from three weather-station readings—ambient temperature, module temperature, and plane-of-array irradiation. The model is purposely simple and fully transparent, giving engineers a first‑order estimate and a clear view of how each environmental factor pushes production up or down.

  Libraries Required

pandas # data loading & joins
numpy # numeric helpers
matplotlib.pyplot # quick sanity plots
scikit‑learn # preprocessing, model, metrics
joblib # save the trained pipeline

Dataset Link

Solar Power Generation Data

Step-by-Step Code Implementation

Why linear regression first? Within a normal operating range, the relationship between plane‑of‑array irradiation and panel output is close to linear; temperature adds second‑order derating. A straight‑line fit captures those effects cleanly and exposes their exact strength in the coefficient table.

1.  Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2.  Load generation & weather data

The Kaggle dataset ships as four CSVs (PLANT[1|2]_Generation_Data.csv and PLANT[1|2]_Weather_Sensor_Data.csv).
We’ll work with Plant 1 for clarity; repeat the same steps for Plant 2 if desired.

gen   = pd.read_csv("PLANT1_Generation_Data.csv")
weath = pd.read_csv("PLANT1_Weather_Sensor_Data.csv")

3. Join logic

Generation rows arrive at five‑minute intervals per inverter; the weather station logs at the same cadence. An inner join on DATE_TIME keeps only timestamps where both readings exist, preventing label leakage from forward‑filled weather values.

gen['DATE_TIME']   = pd.to_datetime(gen['DATE_TIME'])
weath['DATE_TIME'] = pd.to_datetime(weath['DATE_TIME'])

# merge the two tables on plant/inverter/timestamp
df = gen.merge(weath, on=['PLANT_ID', 'DATE_TIME'], how='inner')

4.  Select predictors & target

A positive IRRADIATION coefficient confirms that more sun means more power. A negative MODULE_TEMPERATURE coefficient (if present) quantifies thermal losses—each extra degree knocks off a predictable slice of output.

features = ['AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']
target   = 'DC_POWER'              # power fed by inverter before AC conversion

X = df[features]
y = df[target]

5. Pipeline: scaling + linear model

numeric_transformer = Pipeline([('scale', StandardScaler())])

preprocess = ColumnTransformer(
        transformers=[('num', numeric_transformer, features)],
        remainder='drop')

linreg = LinearRegression()

pipe = Pipeline(steps=[('prep', preprocess),
                      ('model', linreg)])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Metrics chosen

R² answers “How much of the output variance do we explain?”, while MAE in kW tells plant operators the average absolute error—easy to translate into missed revenue or curtailment.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} kW")

8. Standard scaling

Standard scaling puts the three predictors on comparable footing (kW per  σ), so the magnitude of each coefficient directly shows relative importance.

coeffs = pd.Series(linreg.coef_, index=features)\
            .sort_values(ascending=False)
print("\nImpact of each weather variable on DC power (kW per std‑dev):")
print(coeffs)

9. Pipeline persistence

Pipeline persistence – Saving the scaler and the regression weights in one .pkl ensures the model treats tomorrow’s SCADA feed exactly like today’s training data.

joblib.dump(pipe, "solar_dc_power_linreg.pkl")

Summary

In around sixty lines of code, we turned raw SCADA and weather logs into a live solar‑output forecaster. The linear model’s clarity helps operations staff grasp how sunlight and temperature tug daily production, while its predictions serve as a quick baseline for scheduling maintenance or bidding into the day‑ahead market. Once this interpretable benchmark is in place, you can layer in time‑lags, cloud‑cover imagery, or gradient‑boosting to shave additional kilowatt‑hours off the error band—confident you’re really improving on a solid starting point.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook

Solar Energy Output Prediction using Linear Regression in ML

Libraries Required

Dataset Link

Step-by-Step Code Implementation

1. Import libraries

2. Load generation & weather data

3. Join logic

4. Select predictors & target

5. Pipeline: scaling + linear model

6. Train‑test split & training

7. Metrics chosen

8. Standard scaling

9. Pipeline persistence

Summary

Leave a Reply Cancel reply

  Libraries Required

1.  Import libraries

2.  Load generation & weather data

4.  Select predictors & target