Water Usage Prediction using Linear Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Municipal utilities must anticipate the number of millions of litres of potable water residents will draw tomorrow so they can schedule pumping stations, balance storage tanks, and avoid both costly overproduction and service interruptions. In this hands-on mini-project, we craft a linear regression baseline that predicts the next-day total water usage (m³) for a service region from easily logged variables: yesterday’s consumption, recent rainfall, average temperature, day of the week, and month.
Although utilities often graduate to ARIMA, gradient-boosted trees, or hybrid demand–weather models, starting with a transparent linear fit exposes first-order drivers and provides a yardstick against which every more sophisticated algorithm must prove its worth.
Libraries Required
- pandas # time‑series wrangling
- numpy # numeric helpers
- matplotlib.pyplot # quick sanity plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Water Consumption Forecasting Dataset
Step-by-Step Code Implementation
Why linear regression? Within normal operating ranges, municipal demand generally increases linearly with temperature (hot days boost outdoor use), rainfall (wet days reduce lawn watering), and habitual weekday patterns. A simple line makes these elasticities explicit.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
df = pd.read_csv("daily_water_consumption.csv") # rename after download
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date') # ensure chronological order
3. Shift‑based target
shift(-1) turns today’s row into tomorrow’s label, guaranteeing we never peek into the future.
# we want to predict tomorrow’s usage using information available today df['target_usage_m3'] = df['consumption_m3'].shift(-1) df = df.dropna(subset=['target_usage_m3']).copy()
4. standard scaling
Standard scaling places consumption, rainfall, and temperature on comparable variance scales so the fitted coefficients measure impact per standard deviation—handy when you brief city councils.
# calendar effects df['dayofweek'] = df['date'].dt.dayofweek # 0‑Mon … 6‑Sun df['month'] = df['date'].dt.month # lag features df['consumption_prev_day'] = df['consumption_m3'].shift(0) # today’s actual df['rain_prev_day_mm'] = df['rainfall_mm'].shift(0) df['temp_prev_day_c'] = df['avg_temp_c'].shift(0) # drop rows where any lag is NaN (only the very first record) df = df.dropna()
5. Calendar one‑hots
Calendar one‑hots (dayofweek, month) capture recurring seasonality without imposing a false numeric ordering between, say, February and March.
num_cols = ['consumption_prev_day',
'rain_prev_day_mm',
'temp_prev_day_c']
cat_cols = ['dayofweek', 'month'] # will be one‑hot encoded
target = 'target_usage_m3'
X = df[num_cols + cat_cols]
y = df[target]
6. Pre‑processing & model pipeline
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
7. Train‑test split & training
A train–test split without shuffling protects against look-ahead bias; the model only sees past data during fitting.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False) # keep time order – no leakage
pipe.fit(X_train, y_train)
8. Performance metrics
R² shows how much variance we explain; MAE in cubic metres gives operations a gut‑level feel (“Our typical miss is about 2,500 m³—roughly one hour of pumping”).
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.0f} m³")
9. Inspect coefficients
The coefficient table instantly highlights levers: a significant positive weight on dayofweek_Friday means Friday demand spikes; a negative on rain_prev_day_mm quantifies how every millimetre of rain knocks usage down.
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
feature_names = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)\
.sort_values()
print("\nDemand‑reducing factors:")
print(coefs.head(5))
print("\nDemand‑increasing factors:")
print(coefs.tail(5))
10. Persist the trained pipeline
Model persistence with joblib freezes both preprocessing and coefficients, so tomorrow’s ETL job can call joblib.load and emit a forecast in seconds.
joblib.dump(pipe, "water_usage_linreg.pkl")
Summary
This concise workflow converts raw meter logs and weather data into an explainable predictor of next-day water demand. The linear model supplies two immediate wins:
- Actionable forecasts for pump‑scheduling and reservoir‑balancing.
- Crystal‑clear elasticities that reveal how rainfall, temperature, and weekday rhythms tug daily consumption.
Keep this baseline as your benchmark; if you later deploy ARIMA, Prophet, or gradient‑boosted trees, you’ll know exactly how much real‑world accuracy the extra complexity adds.