Water Treatment Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Water treatment plants face fluctuating operating costs driven by variations in raw‑water quality—poor source water (high turbidity, organic load, hardness) demands more chemicals and longer filtration cycles, increasing cost per litre treated. In this project, we’ll predict the treatment cost per litre based on raw‑water quality metrics (pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic Carbon, Trihalomethanes, Turbidity).

By applying stepwise regression, we’ll isolate the key water‑quality drivers of cost and build a concise, interpretable linear model—enabling plant managers to forecast O&M budgets more accurately and proactively adjust treatment processes.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Water Quality Metrics & Filter Performance Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a public dataset containing nine raw‑water quality parameters and a measured treatment cost per liter. Initial .info() and .describe() confirm data types and ranges.

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/swekerr/water-quality-metrics-and-filter-performance-dataset/download"
df = pd.read_csv(url)

# Inspect data
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We drop any rows missing critical features or the cost target. We separate the nine quality metrics (X) and Cost_per_Liter (y), then perform an 80/20 train–test split.

# Block 2: Clean & prepare features
# Drop rows missing any key variable
df = df.dropna(subset=[
    'pH','Hardness','Solids','Chloramines','Sulfate',
    'Conductivity','Organic_carbon','Trihalomethanes',
    'Turbidity','Cost_per_Liter'
])

# Define predictors and target
X = df[[
    'pH','Hardness','Solids','Chloramines','Sulfate',
    'Conductivity','Organic_carbon','Trihalomethanes','Turbidity'
]]
y = df['Cost_per_Liter']

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our stepwise_selection function iteratively adds the excluded predictor with the lowest p -value below 0.01 (forward inclusion) and removes the included predictor with the highest p -value above 0.05 (backward elimination), repeating until no further changes occur. This yields a parsimonious set of significant water‑quality drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: evaluate adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            pval = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
            new_pvals[col] = pval
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:15} p-value {best_pval:.4f}")

        # Backward step: evaluate removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:15} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Model Fitting: We fit an Ordinary Least Squares regression on the selected features via statsmodels. The .summary() output reports coefficient estimates (cost impact per unit change in each metric), their p -values, R², adjusted R², and diagnostic statistics
Evaluation: We predict on the held‑out test data and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify model performance out of sample.

# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

We plot residuals versus predicted cost to check for non‑random patterns or heteroscedasticity, validating key linear regression assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Liter")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to water treatment cost prediction, we isolate the most impactful cost drivers—such as turbidity, organic carbon, and chloramine levels—while pruning less informative variables.

The final linear model balances interpretability (few, statistically significant predictors) with predictive strength (high test‑set R², low RMSE), equipping water‑treatment managers with a transparent forecasting tool to optimise chemical dosing, anticipate budget needs, and improve operational efficiency.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook