Water Purification Cost Prediction using Linear Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Operating a water‑purification (waste‑water‑to‑potable) plant is, at its core, a money‑management exercise: every extra kilowatt‑hour of aeration, every kilo of alum, and every cubic metre of influent that turns up unexpectedly changes the daily operating bill. Managers, therefore, need a fast way to forecast tomorrow’s purification cost so they can lock in chemical deliveries, negotiate off‑peak power blocks, and explain budget variances to the city council.
In this guide, we build a simple linear‑regression model that predicts a plant’s daily operating cost in USD from signals that most SCADA historians already capture:
- influent flow (million gallons per day)
- average mixed‑liquor suspended solids (MLSS)
- influent chemical‑oxygen‑demand (COD)
- blower/equipment energy use (kWh)
- ambient temperature (°C)
- day‑of‑week and calendar month
The open “Wastewater Treatment Plant Electricity Consumption” dataset contains monthly power use, flow, and actual cost figures, providing a real monetary target to learn from.
Libraries Required
- pandas # tabular wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick diagnostic plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist trained pipeline
Dataset Link
Wastewater Treatment Plant Electricity Consumption
Step-by-Step Code Implementation
Why linear regression first? At most plants, the monthly bill is a roughly linear sum of power, chemicals, labour and a few fixed surcharges. Capturing the electricity share linearly already explains a significant portion of the variance; any nonlinear model must surpass this baseline.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
df = pd.read_csv("wwtp_power_cost.csv")
print(df.head())
Expected columns
| column | meaning |
| Date | yyyy‑mm‑dd (first of month) |
| Flow_MG | influent volume (million gallons) |
| Energy_kWh | total electricity consumed |
| Cost_USD | power bill for the month (target) |
| COD_mgL | average influent COD (mg L⁻¹) |
| MLSS_mgL | average MLSS in the aeration basin |
| Temp_C | mean ambient air temperature |
3. Feature engineering
- Calendar dummies (month, dow) fold in seasonal aeration loads and weekend staffing rules without extra data feeds.
- Interpretability wins: If Energy_kWh carries the most significant positive coefficient, managers instantly see why a blower‑upgrade ROI study makes sense; if high ambient temperature reduces cost (negative coefficient), it suggests less energy spent on heating
df['Date'] = pd.to_datetime(df['Date']) df['month'] = df['Date'].dt.month df['dow'] = df['Date'].dt.dayofweek # treat as categorical later num_cols = ['Flow_MG', 'Energy_kWh', 'COD_mgL', 'MLSS_mgL', 'Temp_C'] cat_cols = ['month', 'dow'] target = 'Cost_USD' X = df[num_cols + cat_cols] y = df[target]
4. Pre‑processing & regression pipeline
preproc = ColumnTransformer([
('cats', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('nums', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
5. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=True, random_state=42)
pipe.fit(X_train, y_train)
6. Evaluation
Mean‑absolute‑error in dollars gives finance a concrete buffer (say ± $4 800 on a $52 k bill) when drafting next quarter’s budget.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f} per month")
7. Understanding cost drivers
Standard scaling on numeric inputs lets the coefficients read intelligibly (e.g., “an extra one σ of influent flow—about 45 MG—adds ≈ $7 200 to the bill”).
# recover full feature names after one‑hot encoding
ohe_names = (pipe.named_steps['prep']
.named_transformers_['cats']
.get_feature_names_out(cat_cols))
all_feats = list(ohe_names) + num_cols
coef_ser = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
.sort_values()
print("\nCost‑reducing factors:")
print(coef_ser.head(5))
print("\nCost‑increasing factors:")
print(coef_ser.tail(5))
Because numeric variables are z‑scored, each numeric coefficient is “USD change for a one‑standard‑deviation increase” in that metric; one‑hot coefficients are offsets relative to reference levels.
8. Persist the trained pipeline
joblib.dump(pipe, "wwtp_purification_cost_linreg.pkl")
Summary
In just a few dozen lines, we turned raw SCADA exports into an explainable purification‑cost forecaster. The linear model:
- Delivers instant dollar estimates for tomorrow’s or next month’s operating bill.
- Highlights high-impact levers—influential flow, COD spikes, and energy usage—that plant managers can actually control.
Keep this transparent baseline on the shelf; when you later bolt on chemical‑dosage data or gradient‑boosted trees, you’ll have a clear MAE target to beat and a set of plain‑English coefficients against which to judge “black‑box” improvements.