Water Purification Cost Prediction using Linear Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

Operating a water‑purification (waste‑water‑to‑potable) plant is, at its core, a money‑management exercise: every extra kilowatt‑hour of aeration, every kilo of alum, and every cubic metre of influent that turns up unexpectedly changes the daily operating bill. Managers, therefore, need a fast way to forecast tomorrow’s purification cost so they can lock in chemical deliveries, negotiate off‑peak power blocks, and explain budget variances to the city council.

In this guide, we build a simple linear‑regression model that predicts a plant’s daily operating cost in USD from signals that most SCADA historians already capture:

  • influent flow (million gallons per day)
  • average mixed‑liquor suspended solids (MLSS)
  • influent chemical‑oxygen‑demand (COD)
  • blower/equipment energy use (kWh)
  • ambient temperature (°C)
  • day‑of‑week and calendar month

The open “Wastewater Treatment Plant Electricity Consumption” dataset contains monthly power use, flow, and actual cost figures, providing a real monetary target to learn from.

Libraries Required

  • pandas # tabular wrangling
  • numpy # numerical helpers
  • matplotlib.pyplot # quick diagnostic plots
  • scikit‑learn # preprocessing, model, metrics
  • joblib # persist trained pipeline

Dataset Link

Wastewater Treatment Plant Electricity Consumption

Step-by-Step Code Implementation

Why linear regression first? At most plants, the monthly bill is a roughly linear sum of power, chemicals, labour and a few fixed surcharges. Capturing the electricity share linearly already explains a significant portion of the variance; any nonlinear model must surpass this baseline.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the data

df = pd.read_csv("wwtp_power_cost.csv")
print(df.head())

Expected columns

column meaning
Date yyyy‑mm‑dd (first of month)
Flow_MG influent volume (million gallons)
Energy_kWh total electricity consumed
Cost_USD power bill for the month (target)
COD_mgL average influent COD (mg L⁻¹)
MLSS_mgL average MLSS in the aeration basin
Temp_C mean ambient air temperature

3. Feature engineering

  • Calendar dummies (month, dow) fold in seasonal aeration loads and weekend staffing rules without extra data feeds.
  • Interpretability wins: If Energy_kWh carries the most significant positive coefficient, managers instantly see why a blower‑upgrade ROI study makes sense; if high ambient temperature reduces cost (negative coefficient), it suggests less energy spent on heating
df['Date']     = pd.to_datetime(df['Date'])
df['month']    = df['Date'].dt.month
df['dow']      = df['Date'].dt.dayofweek          # treat as categorical later

num_cols = ['Flow_MG', 'Energy_kWh', 'COD_mgL', 'MLSS_mgL', 'Temp_C']
cat_cols = ['month', 'dow']
target   = 'Cost_USD'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing & regression pipeline

preproc = ColumnTransformer([
        ('cats', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('nums', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', linreg)
])

5. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=True, random_state=42)

pipe.fit(X_train, y_train)

6. Evaluation

Mean‑absolute‑error in dollars gives finance a concrete buffer (say ± $4 800 on a $52 k bill) when drafting next quarter’s budget.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f} per month")

7. Understanding cost drivers

Standard scaling on numeric inputs lets the coefficients read intelligibly (e.g., “an extra one σ of influent flow—about 45 MG—adds ≈ $7 200 to the bill”).

# recover full feature names after one‑hot encoding
ohe_names = (pipe.named_steps['prep']
                   .named_transformers_['cats']
                   .get_feature_names_out(cat_cols))

all_feats = list(ohe_names) + num_cols
coef_ser  = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
               .sort_values()

print("\nCost‑reducing factors:")
print(coef_ser.head(5))
print("\nCost‑increasing factors:")
print(coef_ser.tail(5))

Because numeric variables are z‑scored, each numeric coefficient is “USD change for a one‑standard‑deviation increase” in that metric; one‑hot coefficients are offsets relative to reference levels.

8. Persist the trained pipeline

joblib.dump(pipe, "wwtp_purification_cost_linreg.pkl")

Summary

In just a few dozen lines, we turned raw SCADA exports into an explainable purification‑cost forecaster. The linear model:

  • Delivers instant dollar estimates for tomorrow’s or next month’s operating bill.
  • Highlights high-impact levers—influential flow, COD spikes, and energy usage—that plant managers can actually control.

Keep this transparent baseline on the shelf; when you later bolt on chemical‑dosage data or gradient‑boosted trees, you’ll have a clear MAE target to beat and a set of plain‑English coefficients against which to judge “black‑box” improvements.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *