Water Usage Prediction using Linear Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Municipal utilities must anticipate the number of millions of litres of potable water residents will draw tomorrow so they can schedule pumping stations, balance storage tanks, and avoid both costly overproduction and service interruptions. In this hands-on mini-project, we craft a linear regression baseline that predicts the next-day total water usage (m³) for a service region from easily logged variables: yesterday’s consumption, recent rainfall, average temperature, day of the week, and month.

Although utilities often graduate to ARIMA, gradient-boosted trees, or hybrid demand–weather models, starting with a transparent linear fit exposes first-order drivers and provides a yardstick against which every more sophisticated algorithm must prove its worth.

Libraries Required

pandas # time‑series wrangling
numpy # numeric helpers
matplotlib.pyplot # quick sanity plots
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Water Consumption Forecasting Dataset

Step-by-Step Code Implementation

Why linear regression? Within normal operating ranges, municipal demand generally increases linearly with temperature (hot days boost outdoor use), rainfall (wet days reduce lawn watering), and habitual weekday patterns. A simple line makes these elasticities explicit.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the data

df = pd.read_csv("daily_water_consumption.csv")   # rename after download
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date')                       # ensure chronological order

3. Shift‑based target

shift(-1) turns today’s row into tomorrow’s label, guaranteeing we never peek into the future.

# we want to predict tomorrow’s usage using information available today
df['target_usage_m3'] = df['consumption_m3'].shift(-1)
df = df.dropna(subset=['target_usage_m3']).copy()

4. standard scaling

Standard scaling places consumption, rainfall, and temperature on comparable variance scales so the fitted coefficients measure impact per standard deviation—handy when you brief city councils.

# calendar effects
df['dayofweek'] = df['date'].dt.dayofweek        # 0‑Mon … 6‑Sun
df['month']     = df['date'].dt.month

# lag features
df['consumption_prev_day'] = df['consumption_m3'].shift(0)   # today’s actual
df['rain_prev_day_mm']     = df['rainfall_mm'].shift(0)
df['temp_prev_day_c']      = df['avg_temp_c'].shift(0)

# drop rows where any lag is NaN (only the very first record)
df = df.dropna()

5. Calendar one‑hots

Calendar one‑hots (dayofweek, month) capture recurring seasonality without imposing a false numeric ordering between, say, February and March.

num_cols = ['consumption_prev_day',
            'rain_prev_day_mm',
            'temp_prev_day_c']

cat_cols = ['dayofweek', 'month']      # will be one‑hot encoded
target   = 'target_usage_m3'

X = df[num_cols + cat_cols]
y = df[target]

6. Pre‑processing & model pipeline

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', linreg)
])

7. Train‑test split & training

A train–test split without shuffling protects against look-ahead bias; the model only sees past data during fitting.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=False)   # keep time order – no leakage

pipe.fit(X_train, y_train)

8. Performance metrics

R² shows how much variance we explain; MAE in cubic metres gives operations a gut‑level feel (“Our typical miss is about 2,500 m³—roughly one hour of pumping”).

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.0f} m³")

9. Inspect coefficients

The coefficient table instantly highlights levers: a significant positive weight on dayofweek_Friday means Friday demand spikes; a negative on rain_prev_day_mm quantifies how every millimetre of rain knocks usage down.

ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

feature_names = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)\
           .sort_values()

print("\nDemand‑reducing factors:")
print(coefs.head(5))
print("\nDemand‑increasing factors:")
print(coefs.tail(5))

10. Persist the trained pipeline

Model persistence with joblib freezes both preprocessing and coefficients, so tomorrow’s ETL job can call joblib.load and emit a forecast in seconds.

joblib.dump(pipe, "water_usage_linreg.pkl")

 Summary

This concise workflow converts raw meter logs and weather data into an explainable predictor of next-day water demand. The linear model supplies two immediate wins:

Actionable forecasts for pump‑scheduling and reservoir‑balancing.
Crystal‑clear elasticities that reveal how rainfall, temperature, and weekday rhythms tug daily consumption.

Keep this baseline as your benchmark; if you later deploy ARIMA, Prophet, or gradient‑boosted trees, you’ll know exactly how much real‑world accuracy the extra complexity adds.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

Water Usage Prediction using Linear Regression in ML

Libraries Required

Dataset Link

Step-by-Step Code Implementation

1. Import Libraries

2. Load the data

3. Shift‑based target

4. standard scaling

5. Calendar one‑hots

6. Pre‑processing & model pipeline

7. Train‑test split & training

8. Performance metrics

9. Inspect coefficients

10. Persist the trained pipeline

Summary

Leave a Reply Cancel reply

Libraries Required

 Summary