Concert Ticket Sales Growth Prediction with Polynomial Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Live‑event promoters and venue operators need to forecast year‑over‑year growth in concert ticket sales (%) to inform budgeting and capacity planning before setting next season’s tour dates. Historic data show that annual ticket volumes depend nonlinearly on prior‑year sales (momentum or saturation), average ticket price (price elasticity), and total box‑office revenue (market demand), with diminishing returns and threshold effects. A simple linear model underfits these curves, while an unconstrained polynomial overfits noise in year‑to‑year fluctuations. By applying Polynomial Regression to engineered features with Ridge (ℓ²) regularisation, we can model smooth growth dynamics and deliver reliable, interpretable forecasts for strategic decision‑making.
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
Step-by-Step Code Implementation
Load Data & Initial Inspection
import pandas as pd
# Load the CSV
df = pd.read_csv("data/AnnualTicketSales.csv")
# Preview relevant columns
df.head()[['YEAR','TICKETS SOLD','TOTAL BOX OFFICE','AVERAGE TICKET PRICE']]
Feature Engineering & Target Definition
We calculate Growth_Pct as the percentage change in TICKETS SOLD from the previous year.
1. Tickets_Prev: captures momentum or saturation.
2. TOTAL BOX OFFICE: overall market demand.
3. AVERAGE TICKET PRICE: price elasticity effects.
# Compute year-over-year growth in tickets sold
df = df.sort_values('YEAR').reset_index(drop=True)
df['Tickets_Prev'] = df['TICKETS SOLD'].shift(1)
df['Growth_Pct'] = (df['TICKETS SOLD'] - df['Tickets_Prev']) / df['Tickets_Prev'] * 100
# Drop the first year (NaN growth)
df = df.dropna(subset=['Growth_Pct'])
# Define features and target
X = df[['Tickets_Prev','TOTAL BOX OFFICE','AVERAGE TICKET PRICE']]
y = df['Growth_Pct']
Exploratory Visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Scatter: prior sales vs growth
sns.scatterplot(x='Tickets_Prev', y='Growth_Pct', data=df, alpha=0.7)
plt.title("Prior Year Tickets vs Growth Rate")
plt.xlabel("Tickets Sold (prev year)")
plt.ylabel("Growth (%)")
plt.show()
Build a Polynomial Regression Pipeline
- StandardScaler normalises feature scales so that Ridge treats all polynomial terms equally.
- PolynomialFeatures expands inputs into polynomial and interaction terms, modelling curvature (e.g., diminishing returns on large prior sales) and synergy (e.g., high price × box-office interactions).
- Ridge applies an ℓ² penalty to shrink noisy high‑order coefficients, preventing overfitting.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()), # normalize scales
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
Train/Test Split & Hyperparameter Search
- degree (1 = linear…3 = cubic) to capture appropriate curvature,
- alpha (10⁻³…10³) controlling regularisation strength,
- using a 5‑fold CV on training years.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
# Split chronologically to avoid look‑ahead bias
train_idx = df['YEAR'] < 2018
X_train, X_test = X[train_idx], X[~train_idx]
y_train, y_test = y[train_idx], y[~train_idx]
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
Evaluate Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f}% growth")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Coefficient inspection reveals which nonlinear terms—such as (Tickets_Prev)² or Tickets_Prev × Average Ticket Price—most influence predicted growth, offering interpretable levers for pricing and marketing strategies.
# Get feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=X.columns)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
# Plot top 10
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Sales Growth")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
By integrating polynomial feature engineering with Ridge regularisation in a concise pipeline, we achieve:
1. Accurate nonlinear forecasts of ticket‑sales growth (low RMSE, strong R²).
2. Controlled model complexity, avoiding overfitting to year‑to‑year noise.
3. Actionable insights: the top polynomial features identify key dynamics—such as momentum thresholds and price × demand interactions—guiding data‑driven