Hotel Occupancy Rate Prediction using Linear Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Hotel revenue managers live and die by occupancy rate—the share of rooms filled each night. Accurate next‑day forecasts steer overbooking strategy, staffing levels, and dynamic pricing decisions.
Using the open‑access Hotel Booking Demand dataset—which logs every reservation for a 2015‑2017 city hotel and resort hotel—we build a linear‑regression baseline that predicts each property’s daily occupancy rate from booking lead‑time, seasonality, market segment mix, and other readily captured signals. A transparent model surfaces first-order demand drivers and establishes a benchmark before experimenting with more complex algorithms.
Libraries Required
- pandas # data wrangling & aggregation
- numpy # numerical helpers
- matplotlib.pyplot # quick sanity plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Step by Step Code Implementation
Why linear regression first?
Within normal demand bands, occupancy reacts almost linearly to seasonality (month, weekday), lead‑time shifts, and average daily rate (adr). A straight‑line fit offers interpretable weights for each driver—ideal for revenue managers’ daily stand‑ups.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load & basic cleaning
df = pd.read_csv("hotel_bookings.csv") # rename file after download
# keep only confirmed stays (not cancelled)
df = df[df['is_canceled'] == 0].copy()
3. Construct an arrival‑date column
df['arrival_date'] = pd.to_datetime(
df['arrival_date_year'].astype(str) + '-' +
df['arrival_date_month'] + '-' +
df['arrival_date_day_of_month'].astype(str),
format="%Y-%B-%d")
4. Aggregate to daily hotel‑level demand
Collapsing millions of rows into daily summaries eliminates booking‑level noise and aligns the prediction target (day‑level occupancy) with typical operational decisions.
daily = (df
.groupby(['hotel', 'arrival_date'])
.agg(bookings=('hotel', 'size'),
lead_time_mean=('lead_time', 'mean'),
adr_mean=('adr', 'mean'),
week_nights=('stays_in_week_nights', 'sum'),
weekend_nights=('stays_in_weekend_nights', 'sum'),
special_req_avg=('total_of_special_requests', 'mean'))
.reset_index())
5. Compute a proxy capacity & occupancy rate
# historical peak bookings ≈ capacity for each property
capacity = daily.groupby('hotel')['bookings'].transform('max')
daily['occupancy_rate'] = daily['bookings'] / capacity
6. Feature engineering
month, dayofweek, and is_weekend capture calendar seasonality; lead_time_mean warns of last‑minute demand spikes or lulls; adr_mean often signals price‑sensitive dip
daily['month'] = daily['arrival_date'].dt.month
daily['dayofweek'] = daily['arrival_date'].dt.dayofweek
daily['is_weekend'] = daily['dayofweek'].isin([5,6]).astype(int)
features = ['hotel', 'month', 'dayofweek', 'is_weekend',
'lead_time_mean', 'adr_mean',
'week_nights', 'weekend_nights',
'special_req_avg']
target = 'occupancy_rate'
X = daily[features]
y = daily[target]
7. Pre‑processing & model pipeline
ohe = OneHotEncoder(handle_unknown='ignore')
preproc = ColumnTransformer([
('hotel_flag', ohe, ['hotel']) # encode city / resort
], remainder='passthrough')
linreg = LinearRegression()
pipe = Pipeline(steps=[('prep', preproc),
('model', linreg)])
8. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
9. Performance metrics
R² reveals how much day‑to‑day variance we explain; MAE in plain occupancy points tells managers the average prediction miss—e.g., ±0.04 means we’re usually within four percentage points.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.3f} (absolute occupancy points)")
10. Inspect influential features
Coefficient table. Positive weights spotlight levers that push occupancy up (e.g., weekends, lower ADR), while negative ones mark drag factors (e.g., long lead‑times during low season).
# Get column names post‑encoding
encoded_hotels = pipe.named_steps['prep']\
.named_transformers_['hotel_flag']\
.get_feature_names_out(['hotel'])
all_feats = list(encoded_hotels) + \
['month','dayofweek','is_weekend',
'lead_time_mean','adr_mean',
'week_nights','weekend_nights','special_req_avg']
coef_series = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
.sort_values()
print("\nFactors decreasing expected occupancy:")
print(coef_series.head(6))
print("\nFactors increasing expected occupancy:")
print(coef_series.tail(6))
11. Model persistence
Saving the joblib pipeline freezes both the encoder and regression weights, ensuring tomorrow’s batch forecast processes raw CSVs identically.
joblib.dump(pipe, "hotel_occupancy_linreg.pkl")
Summary
In fewer than 80 lines of Python, we turned raw booking logs into an explainable, day‑ahead occupancy‑rate predictor. The linear model delivers two immediate wins:
- Actionable forecasts to fine‑tune rate‑plans and staffing.
- Crystal‑clear coefficients that reveal which knobs—price, season, booking window—move the occupancy needle.
Keep this baseline as your yardstick; when you graduate to regularised regression, gradient‑boosted trees, or recurrent networks, you’ll know exactly how much extra accuracy the added complexity buys.