Gas Consumption Trend Prediction with Polynomial Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Utility planners and energy‐market analysts need to forecast monthly natural gas consumption (MMcf) across sectors—residential, commercial, industrial, and electric power—using only exogenous drivers known one to two months in advance: past consumption, heating‐degree days (HDD), cooling‐degree days (CDD), average retail price, and calendar effects. Historical trends reveal nonlinear demand responses: for example, residential usage rises sharply with HDD up to a point before plateauing, while price elasticity interacts with weather extremes. A simple linear model underfits these curvatures, and a high‐degree polynomial without regularisation overfits noise. By applying Polynomial Regression with Ridge regularisation on engineered features, we capture smooth consumption‐weather‐price dynamics and deliver reliable, interpretable forecasts for capacity planning and rate‑case support.
Libraries Required
import pandas as pd # data handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
Step-by-Step Code Implementation
Load Data & Libraries
import pandas as pd
# Adjust path as needed
df = pd.read_csv("natural_gas_usage.csv", parse_dates=["Date"])
# Preview key columns
df.head()[["Date","Residential","Commercial","Industrial","Electric Power","Price"]]
Feature Engineering & EDA
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Rename for convenience
df = df.rename(columns={"Electric Power":"Power"})
# Compute calendar features
df["Month"] = df["Date"].dt.month
df["Year"] = df["Date"].dt.year
# Compute HDD and CDD proxies:
# assuming we have avg temperature column; if not, synthetic:
# Here, we simulate HDD and CDD from monthly avg temp if available.
# For illustration, assume df["Temp"] exists:
# df["HDD"] = np.maximum(65 - df["Temp"], 0) * 30
# df["CDD"] = np.maximum(df["Temp"] - 65, 0) * 30
# Plot residential usage vs Month to see seasonality
sns.lineplot(x="Month", y="Residential", data=df)
plt.title("Average Monthly Residential Gas Usage")
plt.xlabel("Month")
plt.ylabel("Consumption (MMcf)")
plt.show()
Define Features & Target
- Lag feature (Lag1) captures persistence: gas demand tends to follow last month’s level with adjustments.
- Seasonality is modelled using month dummies to capture recurring peaks and troughs.
- Price enters to capture elasticity: higher prices dampen consumption nonlinearly.
# Target: total consumption across end uses df["Total"] = df[["Residential","Commercial","Industrial","Power"]].sum(axis=1) # Features: lagged consumption, month dummies, price df["Lag1"] = df["Total"].shift(1) df = df.dropna(subset=["Lag1"]) # One‑hot encode Month month_dummies = pd.get_dummies(df["Month"], prefix="M", drop_first=True) X = pd.concat([df[["Lag1","Price"]], month_dummies], axis=1) y = df["Total"]
Build Polynomial Regression Pipeline
- StandardScaler z‑scores all inputs so the Ridge penalty weights features uniformly.
- PolynomialFeatures generates squares and cross‑terms (e.g., Lag1², Lag1×Price, Price², Lag1×M_12), capturing curvature and interactions.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
("scale", StandardScaler()),
("poly", PolynomialFeatures(include_bias=False)),
("ridge", Ridge(random_state=42))
])
Train/Test Split & Hyperparameter Search
GridSearchCV tunes:
- polynomial degree (1–3) to balance bias and flexibility,
- regularisation strength α (10⁻³ to 10³) to control complexity,
- using 5‑fold CV optimising RMSE.
from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False # time‑series split
)
param_grid = {
"poly__degree": [1, 2, 3],
"ridge__alpha": np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring="neg_root_mean_squared_error",
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.1f} MMcf")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Coefficient inspection ranks the most influential polynomial features—e.g., Lag1², Lag1×Price, or seasonal cross‑terms—enabling interpretable insights into demand drivers.
# Retrieve feature names after expansion
poly = gs.best_estimator_.named_steps["poly"]
feat_names = poly.get_feature_names_out(input_features=X.columns)
coefs = gs.best_estimator_.named_steps["ridge"].coef_
import pandas as pd
imp = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False).head(10)
plt.figure(figsize=(8,5))
imp.plot(kind="barh")
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Gas Demand")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression approach with Ridge regularisation:
- Accurately forecasts monthly natural gas demand (low RMSE, high R²) by capturing nonlinear weather, price, and seasonality effects.
- Balances model complexity via α‑tuning, safeguarding against overfitting in the presence of interaction terms.
- Yields interpretable drivers such as lagged consumption squared or lag×price interactions, offering tangible levers for operational planning and rate‑case justification.