Energy Usage Prediction using Linear Regression in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
Commercial buildings account for a large slice of global electricity demand. Facility managers need tomorrow’s energy‑use estimate—not yesterday’s bill—to schedule chillers, negotiate power purchases, and spot waste.
In this hands‑on mini‑project, we build a linear‑regression baseline that predicts a building’s hourly meter reading (kWh) from weather, calendar information, and static building characteristics. While advanced models often improve accuracy, a transparent linear fit surfaces the first‑order drivers of consumption and provides a benchmark for future work.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # sanity‑check visuals
- seaborn # quick EDA plots (optional)
- scikit‑learn # split, pipeline, model, metrics
- joblib # save the trained pipeline
Dataset Link
ASHRAE – Great Energy Predictor III B
Step by Step Code Implementation
Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # optional from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
Load the data
Merging tables – Meter readings, weather, and building metadata live in separate CSVs. Merging on building_id, site_id, and timestamp squares everything into one tidy frame.
Download train.csv, weather_train.csv, and building_metadata.csv
# base tables
meter = pd.read_csv("train.csv") # meter readings
weather = pd.read_csv("weather_train.csv") # hourly weather
bmeta = pd.read_csv("building_metadata.csv") # static building info
# merge: meter → building metadata
df = meter.merge(bmeta, on="building_id", how="left")\
.merge(weather, on=["site_id", "timestamp"], how="left")
# quick look
print(df.head())
print(df.shape)
Basic cleaning
Why a log target? Energy readings vary over orders of magnitude between a small office and a vast hospital. log1p compresses extremes, helping linear regression fit residuals more evenly.
# convert timestamp df['timestamp'] = pd.to_datetime(df['timestamp']) # drop rows still missing critical weather fields df = df.dropna(subset=['air_temperature', 'cloud_coverage']) # optional: log‑transform meter readings to stabilise variance df['log_meter_reading'] = np.log1p(df['meter_reading'])
Feature engineering
Calendar variables capture daily and seasonal rhythms that dominate demand—think weekday occupancy peaks or summer cooling loads.
# calendar cues
df['hour'] = df['timestamp'].dt.hour
df['dayofweek'] = df['timestamp'].dt.dayofweek # 0‑Mon … 6‑Sun
df['month'] = df['timestamp'].dt.month
# predictor lists
num_cols = ['square_feet', 'air_temperature', 'dew_temperature',
'cloud_coverage', 'precip_depth_1_hr',
'hour', 'dayofweek', 'month']
cat_cols = ['primary_use', 'meter', 'site_id']
target = 'log_meter_reading'
Pre‑processing & model pipeline
- One‑hot encoding turns categorical flags (primary_use, meter type, site) into neutral binary columns so the model can learn a unique offset for each without imposing a false ordering.
- Pipeline design ensures preprocessing and coefficient fitting travel together—vital for repeatable inference and for exporting the model into production jobs.
ohe = OneHotEncoder(handle_unknown='ignore')
preproc = ColumnTransformer([
('cat', ohe, cat_cols)
], remainder='passthrough') # numeric columns pass through
linreg = LinearRegression(n_jobs=-1)
pipe = Pipeline(steps=[
('prep', preproc),
('model', linreg)
])
Train/test split and training
X = df[num_cols + cat_cols]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
Evaluation
Performance metrics – R² shows share of variance explained; MAE (in log scale or kWh) gives planners a tangible forecast error band for safety‑stock electricity contracts.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.3f} (log‑scale)")
(If you skipped the log transform, report MAE in kWh instead.)
Inspect top coefficients
Positive values signal features that raise consumption (e.g., larger square_feet, high air_temperature for cooling), while negative values reveal potential savings levers.
# recover the feature names produced by OneHotEncoder
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
feature_names = list(ohe_feats) + num_cols
coef_series = pd.Series(pipe.named_steps['model'].coef_,
index=feature_names).sort_values()
print("Largest positive drivers:")
print(coef_series.tail(10))
print("\nLargest negative drivers:")
print(coef_series.head(10))
Persist the trained pipeline.
Persisting with joblib freezes both the encoder and the regression weights, guarding against column‑order mix‑ups when scoring tomorrow’s sensor feed.
joblib.dump(pipe, "building_energy_linreg.pkl")
Summary
Starting from raw logs, this exercise walks through a data‑to‑insight pipeline for hourly building energy usage prediction. With minimal cleaning, calendar+weather features, and a linear model, we can already flag the biggest consumption drivers and deliver a same‑day forecast that operations teams can trust. Keep this interpretable baseline as your yardstick; when you graduate to regularised regressors or gradient‑boosted trees, you’ll know exactly how much real value the added complexity contributes.