Hotel Occupancy Rate Prediction using Linear Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Hotel revenue managers live and die by occupancy rate—the share of rooms filled each night. Accurate next‑day forecasts steer overbooking strategy, staffing levels, and dynamic pricing decisions.

Using the open‑access Hotel Booking Demand dataset—which logs every reservation for a 2015‑2017 city hotel and resort hotel—we build a linear‑regression baseline that predicts each property’s daily occupancy rate from booking lead‑time, seasonality, market segment mix, and other readily captured signals. A transparent model surfaces first-order demand drivers and establishes a benchmark before experimenting with more complex algorithms.

 Libraries Required

pandas # data wrangling & aggregation
numpy # numerical helpers
matplotlib.pyplot # quick sanity plots
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Hotel Booking Demand

Step by Step Code Implementation

Why linear regression first?

Within normal demand bands, occupancy reacts almost linearly to seasonality (month, weekday), lead‑time shifts, and average daily rate (adr). A straight‑line fit offers interpretable weights for each driver—ideal for revenue managers’ daily stand‑ups.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load & basic cleaning

df = pd.read_csv("hotel_bookings.csv")   # rename file after download
# keep only confirmed stays (not cancelled)
df = df[df['is_canceled'] == 0].copy()

3. Construct an arrival‑date column

df['arrival_date'] = pd.to_datetime(
        df['arrival_date_year'].astype(str)   + '-' +
        df['arrival_date_month']              + '-' +
        df['arrival_date_day_of_month'].astype(str),
        format="%Y-%B-%d")

4. Aggregate to daily hotel‑level demand

Collapsing millions of rows into daily summaries eliminates booking‑level noise and aligns the prediction target (day‑level occupancy) with typical operational decisions.

daily = (df
         .groupby(['hotel', 'arrival_date'])
         .agg(bookings=('hotel', 'size'),
              lead_time_mean=('lead_time', 'mean'),
              adr_mean=('adr', 'mean'),
              week_nights=('stays_in_week_nights', 'sum'),
              weekend_nights=('stays_in_weekend_nights', 'sum'),
              special_req_avg=('total_of_special_requests', 'mean'))
         .reset_index())

5. Compute a proxy capacity & occupancy rate

# historical peak bookings ≈ capacity for each property
capacity = daily.groupby('hotel')['bookings'].transform('max')
daily['occupancy_rate'] = daily['bookings'] / capacity

6. Feature engineering

month, dayofweek, and is_weekend capture calendar seasonality; lead_time_mean warns of last‑minute demand spikes or lulls; adr_mean often signals price‑sensitive dip

daily['month']      = daily['arrival_date'].dt.month
daily['dayofweek']  = daily['arrival_date'].dt.dayofweek
daily['is_weekend'] = daily['dayofweek'].isin([5,6]).astype(int)

features   = ['hotel', 'month', 'dayofweek', 'is_weekend',
              'lead_time_mean', 'adr_mean',
              'week_nights', 'weekend_nights',
              'special_req_avg']
target     = 'occupancy_rate'

X = daily[features]
y = daily[target]

7. Pre‑processing & model pipeline

ohe = OneHotEncoder(handle_unknown='ignore')

preproc = ColumnTransformer([
        ('hotel_flag', ohe, ['hotel'])     # encode city / resort
    ], remainder='passthrough')

linreg  = LinearRegression()

pipe = Pipeline(steps=[('prep', preproc),
                      ('model', linreg)])

8. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

9. Performance metrics

R² reveals how much day‑to‑day variance we explain; MAE in plain occupancy points tells managers the average prediction miss—e.g., ±0.04 means we’re usually within four percentage points.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.3f} (absolute occupancy points)")

10.  Inspect influential features

Coefficient table. Positive weights spotlight levers that push occupancy up (e.g., weekends, lower ADR), while negative ones mark drag factors (e.g., long lead‑times during low season).

# Get column names post‑encoding
encoded_hotels = pipe.named_steps['prep']\
                     .named_transformers_['hotel_flag']\
                     .get_feature_names_out(['hotel'])
all_feats = list(encoded_hotels) + \
            ['month','dayofweek','is_weekend',
             'lead_time_mean','adr_mean',
             'week_nights','weekend_nights','special_req_avg']

coef_series = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
                 .sort_values()

print("\nFactors decreasing expected occupancy:")
print(coef_series.head(6))

print("\nFactors increasing expected occupancy:")
print(coef_series.tail(6))

11. Model persistence

Saving the joblib pipeline freezes both the encoder and regression weights, ensuring tomorrow’s batch forecast processes raw CSVs identically.

joblib.dump(pipe, "hotel_occupancy_linreg.pkl")

  Summary

In fewer than 80 lines of Python, we turned raw booking logs into an explainable, day‑ahead occupancy‑rate predictor. The linear model delivers two immediate wins:

Actionable forecasts to fine‑tune rate‑plans and staffing.
Crystal‑clear coefficients that reveal which knobs—price, season, booking window—move the occupancy needle.

Keep this baseline as your yardstick; when you graduate to regularised regression, gradient‑boosted trees, or recurrent networks, you’ll know exactly how much extra accuracy the added complexity buys.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google | Facebook

Hotel Occupancy Rate Prediction using Linear Regression in ML

Libraries Required

Dataset Link

Step by Step Code Implementation

1. Import Libraries

2. Load & basic cleaning

3. Construct an arrival‑date column

4. Aggregate to daily hotel‑level demand

5. Compute a proxy capacity & occupancy rate

6. Feature engineering

7. Pre‑processing & model pipeline

8. Train‑test split & training

9. Performance metrics

10. Inspect influential features

11. Model persistence

Summary

Leave a Reply Cancel reply

 Libraries Required

2. Load & basic cleaning

10.  Inspect influential features

  Summary