Solar Farm Cost Prediction with Ridge & Lasso Mixed Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Utility‑scale photovoltaic (PV) farms can exceed US$1 billion, and lenders demand a bankable cost estimate well before EPC contracts are signed. Early estimates hinge on many intertwined factors—module technology, tracking system, site irradiance, country labour index, year of notice‑to‑proceed, and expected capacity factor. These predictors are often highly collinear (e.g. “tracking = yes” strongly correlates with higher capacity factors), making ordinary least‑squares unstable; a pure Lasso model may over‑shrink and discard genuinely helpful variables.
A mixed (Elastic Net) regression—blending Ridge’s ℓ² penalty with Lasso’s ℓ¹ sparsity—offers the best of both worlds: it keeps correlated but important variables and zeros out trivial ones. We will train an Elastic Net model to forecast a solar project’s capital cost (USD per kW) from public, pre‑financial‑close descriptors.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset Link
LCOE & Capital Cost of Electricity Generation
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Load dataset
The LCOE file collects construction, refurbishment, and O&M costs for dozens of solar plants worldwide, plus year, plant sub‑type (fixed‑tilt PV, tracking PV, CSP), and country.
df = pd.read_csv("LCOE_Capital_Costs.csv") # filename inside Kaggle notebook repo
# Keep only photovoltaic rows for a “solar farm”‑specific model
df = df[df['Plant type'].str.contains('Solar', case=False)]
3. Feature engineering
We use Construction costs (USD/MWh) as a capital‑expenditure proxy. (Multiply by plant lifetime energy to get USD/kW if needed; the modelling mechanics stay identical.)
# Example available columns (check file for exact names)
keep = df[['Year', 'Country', 'Plant category', 'Plant type',
'Construction costs (USD/MWh)',
'Capacity factor (%)',
'Refurbishment costs (USD/MWh)']].dropna()
# Target: construction CAPEX proxy (USD/MWh) → converts to USD/kW with CF + hours if desired
y = keep['Construction costs (USD/MWh)']
X = keep.drop(columns=['Construction costs (USD/MWh)'])
cat_cols = ['Country', 'Plant category', 'Plant type']
num_cols = ['Year', 'Capacity factor (%)', 'Refurbishment costs (USD/MWh)']
4. Build an Elastic Net pipeline
Categorical variables are compacted into one-hot vectors; numeric variables are z-scaled so the penalty treats all features equally. Wrapping everything inside the Pipeline avoids data leakage during CV.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=15000, random_state=42))
])
5. Train/test split + hyper‑parameter grid search
- α controls overall penalty strength: higher α ⇒ more shrinkage.
- l1_ratio (0 → Ridge, 1 → Lasso) controls sparsity vs. stability.
A grid of 18 α values × 9 mix ratios (162 models) is searched via 5‑fold CV; the lowest RMSE combination wins.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {'enet__alpha': np.logspace(-3, 1, 18),
'enet__l1_ratio': np.linspace(0.1, 0.9, 9)} # Ridge‑heavy→Lasso‑heavy
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
print("Best α:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])
6. Evaluate model
RMSE in USD/MWh provides an intuitive measure of accuracy; R2R^{2} shows the variance explained on unseen records.
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.0f} per MWh | R²: {r2:.3f}")
7. Interpret coefficients
Coefficients reveal, for example, that tracking systems bump costs by $35/MWh, while every 1% increase in capacity factor (i.e., a sunnier site or trackers) reduces capital cost per MWh because lifetime generation rises. Zeroed dummy variables indicate countries in which costs mimic the baseline after controlling for year and tech choices.
# Recover complete feature names
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# Reverse scaling for numeric columns
scale = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale
imp = pd.Series(coef, index=feature_names).sort_values(key=abs, ascending=False)
plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – Solar CAPEX Drivers')
plt.xlabel('Δ USD per MWh'); plt.tight_layout(); plt.show()
Summary
With ~130 lines of Python, we produced an Elastic Net “mixed regression” model that:
- Forecasts solar‑farm construction CAPEX from public pre‑bid inputs with low RMSE.
- Balances Ridge robustness and Lasso sparsity, handling multicollinearity while trimming noise.
- Explains itself via apparent dollar‑impact coefficients useful to developers, lenders, and policy analysts.
Updating the model is trivial: drop a fresh CSV of global bid data into the notebook and call gs.fit—keeping your capital‑cost curves sharp as module prices and labour markets evolve.