Water Treatment Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Water treatment plants face fluctuating operating costs driven by variations in raw‑water quality—poor source water (high turbidity, organic load, hardness) demands more chemicals and longer filtration cycles, increasing cost per litre treated. In this project, we’ll predict the treatment cost per litre based on raw‑water quality metrics (pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic Carbon, Trihalomethanes, Turbidity).

By applying stepwise regression, we’ll isolate the key water‑quality drivers of cost and build a concise, interpretable linear model—enabling plant managers to forecast O&M budgets more accurately and proactively adjust treatment processes.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization 

Dataset

Water Quality Metrics & Filter Performance Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a public dataset containing nine raw‑water quality parameters and a measured treatment cost per liter. Initial .info() and .describe() confirm data types and ranges.

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/swekerr/water-quality-metrics-and-filter-performance-dataset/download"
df = pd.read_csv(url)

# Inspect data
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We drop any rows missing critical features or the cost target. We separate the nine quality metrics (X) and Cost_per_Liter (y), then perform an 80/20 train–test split.

# Block 2: Clean & prepare features
# Drop rows missing any key variable
df = df.dropna(subset=[
    'pH','Hardness','Solids','Chloramines','Sulfate',
    'Conductivity','Organic_carbon','Trihalomethanes',
    'Turbidity','Cost_per_Liter'
])

# Define predictors and target
X = df[[
    'pH','Hardness','Solids','Chloramines','Sulfate',
    'Conductivity','Organic_carbon','Trihalomethanes','Turbidity'
]]
y = df['Cost_per_Liter']

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

Our stepwise_selection function iteratively adds the excluded predictor with the lowest p -value below 0.01 (forward inclusion) and removes the included predictor with the highest p -value above 0.05 (backward elimination), repeating until no further changes occur. This yields a parsimonious set of significant water‑quality drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: evaluate adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            pval = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
            new_pvals[col] = pval
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:15} p-value {best_pval:.4f}")

        # Backward step: evaluate removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:15} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

  • Model Fitting: We fit an Ordinary Least Squares regression on the selected features via statsmodels. The .summary() output reports coefficient estimates (cost impact per unit change in each metric), their p -values, R², adjusted R², and diagnostic statistics
  • Evaluation: We predict on the held‑out test data and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify model performance out of sample.
# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

We plot residuals versus predicted cost to check for non‑random patterns or heteroscedasticity, validating key linear regression assumptions.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Liter")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to water treatment cost prediction, we isolate the most impactful cost drivers—such as turbidity, organic carbon, and chloramine levels—while pruning less informative variables.

The final linear model balances interpretability (few, statistically significant predictors) with predictive strength (high test‑set R², low RMSE), equipping water‑treatment managers with a transparent forecasting tool to optimise chemical dosing, anticipate budget needs, and improve operational efficiency.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *