Water Quality Treatment Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Water treatment facilities incur variable costs depending on the quality of incoming water—poorer-quality source water typically requires more intensive treatment (e.g., higher chemical dosing, longer filtration cycles), thereby increasing per-cubic-meter costs. In this project, we will predict the treatment cost per liter based on raw water quality metrics (pH, turbidity, conductivity, organic carbon, etc.) and filter‐performance indicators.

By applying stepwise regression, we’ll isolate the most significant water‐quality drivers of treatment cost and build a concise, interpretable linear model—enabling plant managers to forecast O&M budgets more accurately and adjust operations proactively.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation  
import matplotlib.pyplot as plt   # Visualization

Dataset

Water Quality Metrics & Filter Performance Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load water‐quality metrics alongside the target Cost_per_Liter and inspect for data types, missing values, and summary statistics.

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/swekerr/water-quality-metrics-and-filter-performance-dataset/download"
df = pd.read_csv(url)

# Inspect data
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

Rows missing any key metric or cost value are dropped. We define X as the nine water‐quality predictors and y as treatment cost per litre, then split the data 80/20 into training and test sets.

# Block 2: Clean & prepare
# Drop rows with missing values in key columns
df = df.dropna(subset=[
    'pH','Hardness','Solids','Chloramines','Sulfate',
    'Conductivity','Organic_carbon','Trihalomethanes',
    'Turbidity','Cost_per_Liter'
])

# Define predictors and target
X = df[[
    'pH','Hardness','Solids','Chloramines','Sulfate',
    'Conductivity','Organic_carbon','Trihalomethanes','Turbidity'
]]
y = df['Cost_per_Liter']

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our stepwise_selection function alternates between forward inclusion (adding the excluded predictor with the lowest p‑value < 0.01) and backward elimination (removing the included predictor with the highest p‑value > 0.05) until no further changes occur, yielding a parsimonious set of cost drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:15} p-value {best_pval:.4f}")

        # Backward step: test removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:15} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Model Fitting: Using the selected features, we fit an Ordinary Least Squares regression via statsmodels. The .summary() output details coefficient estimates (cost impact per unit change), p‑values, R², and model diagnostics (AIC, F‑statistic).
Evaluation: We predict on the test set and compute R² (variance explained) and RMSE (prediction error) to quantify how well the model generalises.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

Plotting residuals versus predicted costs checks for non‑random patterns or heteroscedasticity, validating OLS assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Liter")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

Applying stepwise regression to water‐quality data isolates the most impactful predictors—such as turbidity, organic carbon, and chloramine levels—that drive treatment cost per litre.

The final linear model balances interpretability (few statistically significant predictors) with predictive accuracy (high test-set R² and low RMSE), equipping water utilities with a transparent forecasting tool to optimise treatment processes and budget more effectively.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook