Resort Occupancy Growth Prediction with Polynomial Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Resort operators and revenue‑management teams need to forecast week‑over‑week occupancy growth (%) for room inventory, using only early‑week metrics—prior occupancy, average lead time, booking pace (daily arrivals), promotional status, and seasonal indicators—before mid‑week rate adjustments. Empirical patterns show nonlinear dynamics: small increases in lead time can sharply boost occupancy during off‑peak periods but yield diminishing returns near full capacity; promotions interact with seasonality in complex ways. A plain linear model underfits these curved responses, while a naïve high‑degree polynomial overfits noise. By applying Polynomial Regression on engineered features with Ridge regularisation, we capture smooth, interpretable growth curves and deliver accurate occupancy‑growth forecasts for optimized rate decisions.
Dataset
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.compose import ColumnTransformer from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Import Libraries & Load Data
import pandas as pd
# Load and filter for resort hotels only
df = pd.read_csv("data/hotel_bookings.csv")
df = df[df['hotel'] == 'Resort Hotel']
3. Feature Engineering & Exploratory Analysis
Data aggregation: group daily bookings to weekly totals for resort hotels; compute prior‑week bookings and growth_pct as our target.
import seaborn as sns
import matplotlib.pyplot as plt
# Compute week number and arrival-date lag
df['arrival_date'] = pd.to_datetime(df['arrival_date_year'].astype(str) + '-'
+ df['arrival_date_month'] + '-'
+ df['arrival_date_day_of_month'].astype(str))
df['week'] = df['arrival_date'].dt.isocalendar().week
# Group to weekly occupancy and booking metrics
weekly = (df.groupby('week')
.agg({
'is_canceled': 'count', # total bookings
'lead_time': 'mean', # avg lead time
'arrival_date': 'count', # proxy arrivals
'arrival_date_month': 'first', # season
'is_repeated_guest': 'mean' # repeat %
})
.rename(columns={'is_canceled':'bookings'})
.reset_index())
# Compute prior-week occupancy growth
weekly['bookings_prev'] = weekly['bookings'].shift(1)
weekly['growth_pct'] = ((weekly['bookings'] - weekly['bookings_prev'])
/ weekly['bookings_prev']) * 100
weekly.dropna(subset=['growth_pct'], inplace=True)
# Visualize nonlinear trend: lead time vs growth
sns.scatterplot(x='lead_time', y='growth_pct', data=weekly, alpha=0.6)
plt.title("Lead Time vs Occupancy Growth")
plt.xlabel("Average Lead Time (days)")
plt.ylabel("Growth (%)")
plt.show()
4. Define Features & Target
Feature matrix: includes bookings_prev, lead_time, is_repeated_guest, plus one‑hot seasonal dummies for months.
# One‑hot encode month
weekly = pd.get_dummies(weekly, columns=['arrival_date_month'], drop_first=True)
feature_cols = (
['bookings_prev','lead_time','is_repeated_guest'] +
[c for c in weekly.columns if c.startswith('arrival_date_month_')]
)
X = weekly[feature_cols]
y = weekly['growth_pct']
5. Build Polynomial Regression Pipeline
- StandardScaler z‑scores inputs so ℓ² penalty treats all terms equally.
- PolynomialFeatures generates squares and interactions (e.g. lead_time², bookings_prev×is_repeated_guest) to model saturation and synergy effects.
- Ridge regression (ℓ²) shrinks noisy high‑order coefficients, preventing overfitting in the expanded feature space.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
6. Train/Test Split & Hyperparameter Search
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
7. Evaluate Model
GridSearchCV: tunes polynomial degree (1–3) and regularisation strength α (10⁻³–10³) via 5‑fold CV, optimising for lowest RMSE on occupancy‑growth predictions.
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f}% growth")
print(f"Test R² : {r2:.3f}")
8. Inspect Key Polynomial Coefficients
Coefficient inspection: ranks the most influential polynomial terms—guiding revenue managers on lever combinations (e.g. high lead time squared or past bookings × repeat‑rate) that drive the largest growth impacts.
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=feature_cols)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
imp = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False).head(10)
plt.figure(figsize=(8,5))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Occupancy Growth")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression approach with Ridge regularisation:
- Accurately forecasts nonlinear occupancy growth, capturing promotion and seasonal curvatures.
- Controls complexity through α tuning, avoiding spurious high‑order effects.
- Yields interpretable insights: top polynomial features identify actionable levers—such as lead‑time thresholds and repeat‑guest interactions—enabling dynamic pricing and marketing optimizations.