Solar Energy Output Prediction using Linear Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

A photovoltaic (PV) farm’s revenue depends on how much electricity its panels actually produce hour by hour. Grid operators schedule reserve capacity based on those forecasts, and plant managers decide when to clean modules or tilt trackers.

In this mini-project, we build a linear-regression baseline that predicts the instantaneous DC power output (kW) of a utility-scale solar plant from three weather-station readings—ambient temperature, module temperature, and plane-of-array irradiation. The model is purposely simple and fully transparent, giving engineers a first‑order estimate and a clear view of how each environmental factor pushes production up or down.

  Libraries Required

  • pandas # data loading & joins
  • numpy # numeric helpers
  • matplotlib.pyplot # quick sanity plots
  • scikit‑learn # preprocessing, model, metrics
  • joblib # save the trained pipeline

Dataset Link

Solar Power Generation Data

Step-by-Step Code Implementation

Why linear regression first? Within a normal operating range, the relationship between plane‑of‑array irradiation and panel output is close to linear; temperature adds second‑order derating. A straight‑line fit captures those effects cleanly and exposes their exact strength in the coefficient table.

1.  Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2.  Load generation & weather data

The Kaggle dataset ships as four CSVs (PLANT[1|2]_Generation_Data.csv and PLANT[1|2]_Weather_Sensor_Data.csv).
We’ll work with Plant 1 for clarity; repeat the same steps for Plant 2 if desired.

gen   = pd.read_csv("PLANT1_Generation_Data.csv")
weath = pd.read_csv("PLANT1_Weather_Sensor_Data.csv")

3. Join logic

Generation rows arrive at five‑minute intervals per inverter; the weather station logs at the same cadence. An inner join on DATE_TIME keeps only timestamps where both readings exist, preventing label leakage from forward‑filled weather values.

gen['DATE_TIME']   = pd.to_datetime(gen['DATE_TIME'])
weath['DATE_TIME'] = pd.to_datetime(weath['DATE_TIME'])

# merge the two tables on plant/inverter/timestamp
df = gen.merge(weath, on=['PLANT_ID', 'DATE_TIME'], how='inner')

4.  Select predictors & target

A positive IRRADIATION coefficient confirms that more sun means more power. A negative MODULE_TEMPERATURE coefficient (if present) quantifies thermal losses—each extra degree knocks off a predictable slice of output.

features = ['AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']
target   = 'DC_POWER'              # power fed by inverter before AC conversion

X = df[features]
y = df[target]

5. Pipeline: scaling + linear model

numeric_transformer = Pipeline([('scale', StandardScaler())])

preprocess = ColumnTransformer(
        transformers=[('num', numeric_transformer, features)],
        remainder='drop')

linreg = LinearRegression()

pipe = Pipeline(steps=[('prep', preprocess),
                      ('model', linreg)])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Metrics chosen

R² answers “How much of the output variance do we explain?”, while MAE in kW tells plant operators the average absolute error—easy to translate into missed revenue or curtailment.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} kW")

8.  Standard scaling

Standard scaling puts the three predictors on comparable footing (kW per  σ), so the magnitude of each coefficient directly shows relative importance.

coeffs = pd.Series(linreg.coef_, index=features)\
            .sort_values(ascending=False)
print("\nImpact of each weather variable on DC power (kW per std‑dev):")
print(coeffs)

9. Pipeline persistence

Pipeline persistence – Saving the scaler and the regression weights in one .pkl ensures the model treats tomorrow’s SCADA feed exactly like today’s training data.

joblib.dump(pipe, "solar_dc_power_linreg.pkl")

Summary

In around sixty lines of code, we turned raw SCADA and weather logs into a live solar‑output forecaster. The linear model’s clarity helps operations staff grasp how sunlight and temperature tug daily production, while its predictions serve as a quick baseline for scheduling maintenance or bidding into the day‑ahead market. Once this interpretable benchmark is in place, you can layer in time‑lags, cloud‑cover imagery, or gradient‑boosting to shave additional kilowatt‑hours off the error band—confident you’re really improving on a solid starting point.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *