Water Treatment Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Water treatment plants face fluctuating operating costs driven by variations in raw‑water quality—poor source water (high turbidity, organic load, hardness) demands more chemicals and longer filtration cycles, increasing cost per litre treated. In this project, we’ll predict the treatment cost per litre based on raw‑water quality metrics (pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic Carbon, Trihalomethanes, Turbidity).
By applying stepwise regression, we’ll isolate the key water‑quality drivers of cost and build a concise, interpretable linear model—enabling plant managers to forecast O&M budgets more accurately and proactively adjust treatment processes.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Ordinary Least Squares regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Water Quality Metrics & Filter Performance Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a public dataset containing nine raw‑water quality parameters and a measured treatment cost per liter. Initial .info() and .describe() confirm data types and ranges.
# Block 1: Load dataset url = "https://www.kaggle.com/datasets/swekerr/water-quality-metrics-and-filter-performance-dataset/download" df = pd.read_csv(url) # Inspect data print(df.head()) print(df.info()) print(df.describe())
Data Preprocessing
We drop any rows missing critical features or the cost target. We separate the nine quality metrics (X) and Cost_per_Liter (y), then perform an 80/20 train–test split.
# Block 2: Clean & prepare features
# Drop rows missing any key variable
df = df.dropna(subset=[
'pH','Hardness','Solids','Chloramines','Sulfate',
'Conductivity','Organic_carbon','Trihalomethanes',
'Turbidity','Cost_per_Liter'
])
# Define predictors and target
X = df[[
'pH','Hardness','Solids','Chloramines','Sulfate',
'Conductivity','Organic_carbon','Trihalomethanes','Turbidity'
]]
y = df['Cost_per_Liter']
# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
Our stepwise_selection function iteratively adds the excluded predictor with the lowest p -value below 0.01 (forward inclusion) and removes the included predictor with the highest p -value above 0.05 (backward elimination), repeating until no further changes occur. This yields a parsimonious set of significant water‑quality drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: evaluate adding each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
pval = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
new_pvals[col] = pval
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:15} p-value {best_pval:.4f}")
# Backward step: evaluate removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:15} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: We fit an Ordinary Least Squares regression on the selected features via statsmodels. The .summary() output reports coefficient estimates (cost impact per unit change in each metric), their p -values, R², adjusted R², and diagnostic statistics
- Evaluation: We predict on the held‑out test data and compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify model performance out of sample.
# Block 4: Select features
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
We plot residuals versus predicted cost to check for non‑random patterns or heteroscedasticity, validating key linear regression assumptions.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per Liter")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
By applying stepwise regression to water treatment cost prediction, we isolate the most impactful cost drivers—such as turbidity, organic carbon, and chloramine levels—while pruning less informative variables.
The final linear model balances interpretability (few, statistically significant predictors) with predictive strength (high test‑set R², low RMSE), equipping water‑treatment managers with a transparent forecasting tool to optimise chemical dosing, anticipate budget needs, and improve operational efficiency.