Water Treatment Plant Cost Prediction with Ridge Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Operating a municipal water‑treatment plant is expensive: pumps and aerators draw megawatt‑hours of power, coagulants and disinfectants fluctuate in price, and stricter discharge permits push costs even higher. Plant managers need an early estimate of next month’s treatment cost so they can lock in chemical contracts, schedule maintenance on high‑energy equipment, and defend budgets to city councils.

We’ll develop a Ridge‑regression model that predicts a plant’s monthly operating cost (USD) from routinely logged SCADA data:

influent flow (million gallons)
blower/pump energy use (kWh)
average air temperature (°C)
chemical‑oxygen demand (COD mg L⁻¹)
mixed‑liquor suspended solids (MLSS mg L⁻¹)
calendar month (seasonality)

Ridge regression keeps the relationship linear—finance can read every coefficient in dollars—while its L2 penalty prevents wild swings when flow, energy and COD move together.

Libraries Required

pandas # data loading and manipulation
numpy # numeric utilities
matplotlib.pyplot # optional quick plots
scikit‑learn # preprocessing, RidgeCV, metrics
joblib # save/reload the fitted pipeline

Dataset Link

Wastewater Treatment Plant Electricity Consumption

Step by Step Ccode implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the dataset

df = pd.read_csv("Wastewater_Treatment_Plant_Electricity_Consumption.csv")
print(df.head())

3. Quick clean‑up

# keep only rows with a complete target
df = df.dropna(subset=['Cost_USD']).copy()

# calendar features
df['Date']   = pd.to_datetime(df['Date'])
df['month']  = df['Date'].dt.month      # captures season‑based chemistry costs

4. Separate feature lists

Influent flow (0–100 MG) and energy (100 000 kWh) sit in very different ranges; scaling prevents Ridge from penalising one purely because of magnitude.

num_cols = ['Influent_MG', 'Energy_kWh',
            'Avg_Temp_C', 'COD_mgL', 'MLSS_mgL']
cat_cols = ['month']
target   = 'Cost_USD'

X = df[num_cols + cat_cols]
y = df[target]

5. Pre‑processing + Ridge pipeline

Searches a small grid of α (L2) values via 5‑fold cross‑validation, returning the model that minimises validation error. This balances under‑ and over‑fit automatically.
Seasonality affects chemical dosage and power (e.g., summer ventilation). One‑hotting gives each month its intercept change without assuming a linear month‑to‑month trend.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first'), cat_cols),
        ('num', StandardScaler(),           num_cols)
])

ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)

model = Pipeline([
        ('prep',  preprocess),
        ('ridge', ridge)
])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=True, random_state=42)

model.fit(X_train, y_train)

7. Evaluation

pred = model.predict(X_test)

print(f"α chosen by CV  : {model.named_steps['ridge'].alpha_}")
print(f"R² (test set)   : {r2_score(y_test, pred):.3f}")
print(f"MAE (test set)  : ${mean_absolute_error(y_test, pred):,.0f}")

8. Inspect cost drivers

Coefficients remain in dollars. A +$8,700 coefficient on Energy_kWh (per σ) quantifies how sensitive cost is to power spikes, while a −$3 500 weight for month_11 shows November’s typical savings versus January.

# get one‑hot names then combine with numeric
ohe      = model.named_steps['prep'].named_transformers_['cat']
ohe_cols = ohe.get_feature_names_out(cat_cols)
features = np.concatenate([ohe_cols, num_cols])

coefs = pd.Series(model.named_steps['ridge'].coef_, index=features)\
          .sort_values()

print("\nCost‑reducing factors:")
print(coefs.head(6))

print("\nCost‑increasing factors:")
print(coefs.tail(6))

9. Persist for production use

Flow, COD and energy often rise together; OLS can inflate their coefficients in opposite directions. Ridge gently shrinks them, improving stability and out‑of‑sample accuracy.

joblib.dump(model, "ridge_wtp_cost_model.pkl")

Summary

With a short, fully transparent workflow, we turned open SCADA and finance data into an actionable cost‑prediction tool:

Immediate value: finance teams see next‑month operating costs days before invoices arrive.
Clear levers: every driver—flow, COD, energy, season—has a quantified dollar impact.
Robust baseline: any future gradient‑boosted or neural approach must beat this Ridge model’s mean‑absolute error and still explain costs in language the city treasurer understands.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook