Water Purification Cost Prediction using Linear Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Operating a water‑purification (waste‑water‑to‑potable) plant is, at its core, a money‑management exercise: every extra kilowatt‑hour of aeration, every kilo of alum, and every cubic metre of influent that turns up unexpectedly changes the daily operating bill. Managers, therefore, need a fast way to forecast tomorrow’s purification cost so they can lock in chemical deliveries, negotiate off‑peak power blocks, and explain budget variances to the city council.

In this guide, we build a simple linear‑regression model that predicts a plant’s daily operating cost in USD from signals that most SCADA historians already capture:

influent flow (million gallons per day)
average mixed‑liquor suspended solids (MLSS)
influent chemical‑oxygen‑demand (COD)
blower/equipment energy use (kWh)
ambient temperature (°C)
day‑of‑week and calendar month

The open “Wastewater Treatment Plant Electricity Consumption” dataset contains monthly power use, flow, and actual cost figures, providing a real monetary target to learn from.

Libraries Required

pandas # tabular wrangling
numpy # numerical helpers
matplotlib.pyplot # quick diagnostic plots
scikit‑learn # preprocessing, model, metrics
joblib # persist trained pipeline

Dataset Link

Wastewater Treatment Plant Electricity Consumption

Step-by-Step Code Implementation

Why linear regression first? At most plants, the monthly bill is a roughly linear sum of power, chemicals, labour and a few fixed surcharges. Capturing the electricity share linearly already explains a significant portion of the variance; any nonlinear model must surpass this baseline.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the data

df = pd.read_csv("wwtp_power_cost.csv")
print(df.head())

Expected columns

column	meaning
Date	yyyy‑mm‑dd (first of month)
Flow_MG	influent volume (million gallons)
Energy_kWh	total electricity consumed
Cost_USD	power bill for the month (target)
COD_mgL	average influent COD (mg L⁻¹)
MLSS_mgL	average MLSS in the aeration basin
Temp_C	mean ambient air temperature

3. Feature engineering

Calendar dummies (month, dow) fold in seasonal aeration loads and weekend staffing rules without extra data feeds.
Interpretability wins: If Energy_kWh carries the most significant positive coefficient, managers instantly see why a blower‑upgrade ROI study makes sense; if high ambient temperature reduces cost (negative coefficient), it suggests less energy spent on heating

df['Date']     = pd.to_datetime(df['Date'])
df['month']    = df['Date'].dt.month
df['dow']      = df['Date'].dt.dayofweek          # treat as categorical later

num_cols = ['Flow_MG', 'Energy_kWh', 'COD_mgL', 'MLSS_mgL', 'Temp_C']
cat_cols = ['month', 'dow']
target   = 'Cost_USD'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing & regression pipeline

preproc = ColumnTransformer([
        ('cats', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('nums', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', linreg)
])

5. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=True, random_state=42)

pipe.fit(X_train, y_train)

6. Evaluation

Mean‑absolute‑error in dollars gives finance a concrete buffer (say ± $4 800 on a $52 k bill) when drafting next quarter’s budget.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f} per month")

7. Understanding cost drivers

Standard scaling on numeric inputs lets the coefficients read intelligibly (e.g., “an extra one σ of influent flow—about 45 MG—adds ≈ $7 200 to the bill”).

# recover full feature names after one‑hot encoding
ohe_names = (pipe.named_steps['prep']
                   .named_transformers_['cats']
                   .get_feature_names_out(cat_cols))

all_feats = list(ohe_names) + num_cols
coef_ser  = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
               .sort_values()

print("\nCost‑reducing factors:")
print(coef_ser.head(5))
print("\nCost‑increasing factors:")
print(coef_ser.tail(5))

Because numeric variables are z‑scored, each numeric coefficient is “USD change for a one‑standard‑deviation increase” in that metric; one‑hot coefficients are offsets relative to reference levels.

8. Persist the trained pipeline

joblib.dump(pipe, "wwtp_purification_cost_linreg.pkl")

Summary

In just a few dozen lines, we turned raw SCADA exports into an explainable purification‑cost forecaster. The linear model:

Delivers instant dollar estimates for tomorrow’s or next month’s operating bill.
Highlights high-impact levers—influential flow, COD spikes, and energy usage—that plant managers can actually control.

Keep this transparent baseline on the shelf; when you later bolt on chemical‑dosage data or gradient‑boosted trees, you’ll have a clear MAE target to beat and a set of plain‑English coefficients against which to judge “black‑box” improvements.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook