TV Production Budget Prediction with Polynomial Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Broadcasters and streaming platforms need to estimate a new series’ per‑episode production budget (USD) early in development, using only high‑level attributes—genre, runtime, number of prominent cast members, director experience, writer experience, number of shooting locations, and intended release month. The relationship between these factors and costs is nonlinear. Each additional shooting day or cast member has diminishing returns on scale, genre premiums vary non‑uniformly, and release timing can amplify costs. A simple linear model underfits these curvatures, while a naïve high‑degree polynomial overfits. By applying Polynomial Regression to engineered features with Ridge regularisation, we can capture smooth, nonlinear effects and deliver interpretable, accurate budget forecasts to guide green‑lighting decisions.
Dataset
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd # data loading and manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Load Data & Inspect
import pandas as pd
# Load the budget dataset
df = pd.read_csv("movie_budgets.csv")
# Preview columns (rename or map to TV context later)
df.head()[['movie','production_budget','domestic_gross',
'runtime','release_date']]
3. Feature Engineering
Feature proxies: We map movie attributes to a TV series context —runtime→episode length, cast_strength from box‑office bins as a stand‑in for star power —and generate num_locations and num_cast synthetically.
# Convert release date to month, as a categorical cost driver df['release_month'] = pd.to_datetime(df['release_date']).dt.month # Proxy features for TV series: # - 'runtime' as per‑episode length # - use 'domestic_gross' / 1e6 as a proxy for cast/director appeal # - generate synthetic 'num_locations' and 'num_cast' from gross bins df['cast_strength'] = pd.qcut(df['domestic_gross'], 5, labels=False) + 1 df['num_locations'] = (df['production_budget'] / 1e6).clip(1, 20).astype(int) df['num_cast'] = (df['cast_strength'] * 3) # proxy: each strength point ≈ 3 leads # Select features and target X = df[['runtime','cast_strength','num_cast','num_locations','release_month']] y = df['production_budget']
4. Visualise Nonlinear Patterns
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='num_locations', y='production_budget',
data=df, alpha=0.6)
plt.title("Locations vs Budget")
plt.xlabel("Number of Shooting Locations")
plt.ylabel("Production Budget (USD)")
plt.show()
5. Build Polynomial Regression Pipeline
- PolynomialFeatures: Expands inputs into squares and pairwise interactions (e.g., runtime², runtime×num_locations), capturing diminishing returns and synergies among scale drivers.
- Preprocessing: Numeric features are z‑scaled; release_month is one‑hot encoded to capture seasonality in costs.
- Ridge: ℓ² regularisation shrinks noisy high‑order coefficients, preventing overfitting in the expanded feature space.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
# Categorical: release_month; Numeric: the rest
cat_cols = ['release_month']
num_cols = ['runtime','cast_strength','num_cast','num_locations']
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(drop='first'), cat_cols)
])
pipe = Pipeline([
('prep', preprocessor),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
6. Train/Test Split & Hyperparameter Search
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
7. Evaluate Model
GridSearchCV: Tunes polynomial degree (1–3) and regularisation α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE on held‑out data.
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f}")
print(f"Test R² : {r2:.3f}")
8. Interpret Key Polynomial Coefficients
Interpretation: The leading coefficients—such as num_locations² or cast_strength×runtime—highlight which nonlinear factors most drive budget variance, guiding development teams on where to allocate resources.
# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
# Construct input feature list post-preprocessing
# Numeric + one-hot release_month dummies
prep = gs.best_estimator_.named_steps['prep']
cat_features = prep.named_transformers_['cat'].get_feature_names_out(cat_cols)
input_features = num_cols + list(cat_features)
feat_names = poly.get_feature_names_out(input_features=input_features)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Budget")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression pipeline provides a transparent, robust approach to predict per‑episode production budgets:
- Captures nonlinear scale effects (diminishing returns on locations, cast size).
- Integrates seasonality (via release‑month encoding) and other categorical drivers.
- Balances complexity through Ridge regularisation, yielding low RMSE and clear coefficient insights for decision-makers.