Solar Farm Cost Prediction with Ridge & Lasso Mixed Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Utility‑scale photovoltaic (PV) farms can exceed US$1 billion, and lenders demand a bankable cost estimate well before EPC contracts are signed. Early estimates hinge on many intertwined factors—module technology, tracking system, site irradiance, country labour index, year of notice‑to‑proceed, and expected capacity factor. These predictors are often highly collinear (e.g. “tracking = yes” strongly correlates with higher capacity factors), making ordinary least‑squares unstable; a pure Lasso model may over‑shrink and discard genuinely helpful variables.

A mixed (Elastic Net) regression—blending Ridge’s ℓ² penalty with Lasso’s ℓ¹ sparsity—offers the best of both worlds: it keeps correlated but important variables and zeros out trivial ones. We will train an Elastic Net model to forecast a solar project’s capital cost (USD per kW) from public, pre‑financial‑close descriptors.

Libraries Required

Purpose	Library
Data wrangling	pandas, numpy
Visualisation	matplotlib, seaborn
ML pipeline	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset Link

LCOE & Capital Cost of Electricity Generation

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Load dataset

The LCOE file collects construction, refurbishment, and O&M costs for dozens of solar plants worldwide, plus year, plant sub‑type (fixed‑tilt PV, tracking PV, CSP), and country.

df = pd.read_csv("LCOE_Capital_Costs.csv")  # filename inside Kaggle notebook repo
# Keep only photovoltaic rows for a “solar farm”‑specific model
df = df[df['Plant type'].str.contains('Solar', case=False)]

3. Feature engineering

We use Construction costs (USD/MWh) as a capital‑expenditure proxy. (Multiply by plant lifetime energy to get USD/kW if needed; the modelling mechanics stay identical.)

# Example available columns (check file for exact names)
keep = df[['Year', 'Country', 'Plant category', 'Plant type',
           'Construction costs (USD/MWh)',
           'Capacity factor (%)',
           'Refurbishment costs (USD/MWh)']].dropna()

# Target: construction CAPEX proxy (USD/MWh)  → converts to USD/kW with CF + hours if desired
y = keep['Construction costs (USD/MWh)']

X = keep.drop(columns=['Construction costs (USD/MWh)'])
cat_cols = ['Country', 'Plant category', 'Plant type']
num_cols = ['Year', 'Capacity factor (%)', 'Refurbishment costs (USD/MWh)']

4. Build an Elastic Net pipeline

Categorical variables are compacted into one-hot vectors; numeric variables are z-scaled so the penalty treats all features equally. Wrapping everything inside the Pipeline avoids data leakage during CV.

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
    ('num', StandardScaler(), num_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=15000, random_state=42))
])

5. Train/test split + hyper‑parameter grid search

α controls overall penalty strength: higher α ⇒ more shrinkage.
l1_ratio (0 → Ridge, 1 → Lasso) controls sparsity vs. stability.
A grid of 18 α values × 9 mix ratios (162 models) is searched via 5‑fold CV; the lowest RMSE combination wins.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

param_grid = {'enet__alpha': np.logspace(-3, 1, 18),
              'enet__l1_ratio': np.linspace(0.1, 0.9, 9)}   # Ridge‑heavy→Lasso‑heavy

gs = GridSearchCV(pipe, param_grid,
                  cv=5,
                  scoring='neg_root_mean_squared_error',
                  n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)

print("Best α:",       gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])

6. Evaluate model

RMSE in USD/MWh provides an intuitive measure of accuracy; R2R^{2} shows the variance explained on unseen records.

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:,.0f} per MWh | R²: {r2:.3f}")

7. Interpret coefficients

Coefficients reveal, for example, that tracking systems bump costs by $35/MWh, while every 1% increase in capacity factor (i.e., a sunnier site or trackers) reduces capital cost per MWh because lifetime generation rises. Zeroed dummy variables indicate countries in which costs mimic the baseline after controlling for year and tech choices.

# Recover complete feature names
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

# Reverse scaling for numeric columns
scale = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef  = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale

imp = pd.Series(coef, index=feature_names).sort_values(key=abs, ascending=False)

plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – Solar CAPEX Drivers')
plt.xlabel('Δ USD per MWh'); plt.tight_layout(); plt.show()

Summary

With ~130 lines of Python, we produced an Elastic Net “mixed regression” model that:

Forecasts solar‑farm construction CAPEX from public pre‑bid inputs with low RMSE.
Balances Ridge robustness and Lasso sparsity, handling multicollinearity while trimming noise.
Explains itself via apparent dollar‑impact coefficients useful to developers, lenders, and policy analysts.

Updating the model is trivial: drop a fresh CSV of global bid data into the notebook and call gs.fit—keeping your capital‑cost curves sharp as module prices and labour markets evolve.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook

Solar Farm Cost Prediction with Ridge & Lasso Mixed Regression in ML

Libraries Required

Dataset Link

Step-by-Step Code Implementation

1. Import Libraries

2. Load dataset

3. Feature engineering

4. Build an Elastic Net pipeline

5. Train/test split + hyper‑parameter grid search

6. Evaluate model

7. Interpret coefficients