Water Treatment Plant Cost Prediction with Ridge Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Operating a municipal water‑treatment plant is expensive: pumps and aerators draw megawatt‑hours of power, coagulants and disinfectants fluctuate in price, and stricter discharge permits push costs even higher. Plant managers need an early estimate of next month’s treatment cost so they can lock in chemical contracts, schedule maintenance on high‑energy equipment, and defend budgets to city councils.
We’ll develop a Ridge‑regression model that predicts a plant’s monthly operating cost (USD) from routinely logged SCADA data:
- influent flow (million gallons)
- blower/pump energy use (kWh)
- average air temperature (°C)
- chemical‑oxygen demand (COD mg L⁻¹)
- mixed‑liquor suspended solids (MLSS mg L⁻¹)
- calendar month (seasonality)
Ridge regression keeps the relationship linear—finance can read every coefficient in dollars—while its L2 penalty prevents wild swings when flow, energy and COD move together.
Libraries Required
- pandas # data loading and manipulation
- numpy # numeric utilities
- matplotlib.pyplot # optional quick plots
- scikit‑learn # preprocessing, RidgeCV, metrics
- joblib # save/reload the fitted pipeline
Dataset Link
Wastewater Treatment Plant Electricity Consumption
Step by Step Ccode implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import RidgeCV from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the dataset
df = pd.read_csv("Wastewater_Treatment_Plant_Electricity_Consumption.csv")
print(df.head())
3. Quick clean‑up
# keep only rows with a complete target df = df.dropna(subset=['Cost_USD']).copy() # calendar features df['Date'] = pd.to_datetime(df['Date']) df['month'] = df['Date'].dt.month # captures season‑based chemistry costs
4. Separate feature lists
Influent flow (0–100 MG) and energy (100 000 kWh) sit in very different ranges; scaling prevents Ridge from penalising one purely because of magnitude.
num_cols = ['Influent_MG', 'Energy_kWh',
'Avg_Temp_C', 'COD_mgL', 'MLSS_mgL']
cat_cols = ['month']
target = 'Cost_USD'
X = df[num_cols + cat_cols]
y = df[target]
5. Pre‑processing + Ridge pipeline
- Searches a small grid of α (L2) values via 5‑fold cross‑validation, returning the model that minimises validation error. This balances under‑ and over‑fit automatically.
- Seasonality affects chemical dosage and power (e.g., summer ventilation). One‑hotting gives each month its intercept change without assuming a linear month‑to‑month trend.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first'), cat_cols),
('num', StandardScaler(), num_cols)
])
ridge = RidgeCV(alphas=[0.1, 1, 10, 50, 100], cv=5)
model = Pipeline([
('prep', preprocess),
('ridge', ridge)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=True, random_state=42)
model.fit(X_train, y_train)
7. Evaluation
pred = model.predict(X_test)
print(f"α chosen by CV : {model.named_steps['ridge'].alpha_}")
print(f"R² (test set) : {r2_score(y_test, pred):.3f}")
print(f"MAE (test set) : ${mean_absolute_error(y_test, pred):,.0f}")
8. Inspect cost drivers
Coefficients remain in dollars. A +$8,700 coefficient on Energy_kWh (per σ) quantifies how sensitive cost is to power spikes, while a −$3 500 weight for month_11 shows November’s typical savings versus January.
# get one‑hot names then combine with numeric
ohe = model.named_steps['prep'].named_transformers_['cat']
ohe_cols = ohe.get_feature_names_out(cat_cols)
features = np.concatenate([ohe_cols, num_cols])
coefs = pd.Series(model.named_steps['ridge'].coef_, index=features)\
.sort_values()
print("\nCost‑reducing factors:")
print(coefs.head(6))
print("\nCost‑increasing factors:")
print(coefs.tail(6))
9. Persist for production use
Flow, COD and energy often rise together; OLS can inflate their coefficients in opposite directions. Ridge gently shrinks them, improving stability and out‑of‑sample accuracy.
joblib.dump(model, "ridge_wtp_cost_model.pkl")
Summary
With a short, fully transparent workflow, we turned open SCADA and finance data into an actionable cost‑prediction tool:
- Immediate value: finance teams see next‑month operating costs days before invoices arrive.
- Clear levers: every driver—flow, COD, energy, season—has a quantified dollar impact.
- Robust baseline: any future gradient‑boosted or neural approach must beat this Ridge model’s mean‑absolute error and still explain costs in language the city treasurer understands.