TV Production Budget Prediction with Polynomial Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Broadcasters and streaming platforms need to estimate a new series’ per‑episode production budget (USD) early in development, using only high‑level attributes—genre, runtime, number of prominent cast members, director experience, writer experience, number of shooting locations, and intended release month. The relationship between these factors and costs is nonlinear. Each additional shooting day or cast member has diminishing returns on scale, genre premiums vary non‑uniformly, and release timing can amplify costs. A simple linear model underfits these curvatures, while a naïve high‑degree polynomial overfits. By applying Polynomial Regression to engineered features with Ridge regularisation, we can capture smooth, nonlinear effects and deliver interpretable, accurate budget forecasts to guide green‑lighting decisions.

Dataset

Movie Production Budget

Step-by-Step Code Implementation

1. Libraries Required

import pandas as pd                         # data loading and manipulation  
import numpy as np                          # numerical operations  

import matplotlib.pyplot as plt             # plotting  
import seaborn as sns                       # enhanced visualization  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder  
from sklearn.compose import ColumnTransformer  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

2. Load Data & Inspect

import pandas as pd

# Load the budget dataset
df = pd.read_csv("movie_budgets.csv")

# Preview columns (rename or map to TV context later)
df.head()[['movie','production_budget','domestic_gross',
           'runtime','release_date']]

3. Feature Engineering

Feature proxies: We map movie attributes to a TV series context —runtime→episode length, cast_strength from box‑office bins as a stand‑in for star power —and generate num_locations and num_cast synthetically.

# Convert release date to month, as a categorical cost driver
df['release_month'] = pd.to_datetime(df['release_date']).dt.month

# Proxy features for TV series:
#   - 'runtime' as per‑episode length
#   - use 'domestic_gross' / 1e6 as a proxy for cast/director appeal
#   - generate synthetic 'num_locations' and 'num_cast' from gross bins
df['cast_strength'] = pd.qcut(df['domestic_gross'], 5, labels=False) + 1
df['num_locations'] = (df['production_budget'] / 1e6).clip(1, 20).astype(int)
df['num_cast'] = (df['cast_strength'] * 3)  # proxy: each strength point ≈ 3 leads

# Select features and target
X = df[['runtime','cast_strength','num_cast','num_locations','release_month']]
y = df['production_budget']

4. Visualise Nonlinear Patterns

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='num_locations', y='production_budget',
                data=df, alpha=0.6)
plt.title("Locations vs Budget")
plt.xlabel("Number of Shooting Locations")
plt.ylabel("Production Budget (USD)")
plt.show()

5. Build Polynomial Regression Pipeline

PolynomialFeatures: Expands inputs into squares and pairwise interactions (e.g., runtime², runtime×num_locations), capturing diminishing returns and synergies among scale drivers.
Preprocessing: Numeric features are z‑scaled; release_month is one‑hot encoded to capture seasonality in costs.
Ridge: ℓ² regularisation shrinks noisy high‑order coefficients, preventing overfitting in the expanded feature space.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge

# Categorical: release_month; Numeric: the rest
cat_cols = ['release_month']
num_cols = ['runtime','cast_strength','num_cast','num_locations']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(drop='first'), cat_cols)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ridge', Ridge(random_state=42))
])

6. Train/Test Split & Hyperparameter Search

from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best parameters:", gs.best_params_)

7. Evaluate Model

GridSearchCV: Tunes polynomial degree (1–3) and regularisation α (10⁻³…10³) via 5‑fold CV, optimising for lowest RMSE on held‑out data.

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: ${rmse:,.0f}")
print(f"Test R²  : {r2:.3f}")

8. Interpret Key Polynomial Coefficients

Interpretation: The leading coefficients—such as num_locations² or cast_strength×runtime—highlight which nonlinear factors most drive budget variance, guiding development teams on where to allocate resources.

# Retrieve feature names after polynomial expansion
poly = gs.best_estimator_.named_steps['poly']
# Construct input feature list post-preprocessing
# Numeric + one-hot release_month dummies
prep = gs.best_estimator_.named_steps['prep']
cat_features = prep.named_transformers_['cat'].get_feature_names_out(cat_cols)
input_features = num_cols + list(cat_features)

feat_names = poly.get_feature_names_out(input_features=input_features)
coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)

plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Budget")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()

Summary

This Polynomial Regression pipeline provides a transparent, robust approach to predict per‑episode production budgets:

Captures nonlinear scale effects (diminishing returns on locations, cast size).
Integrates seasonality (via release‑month encoding) and other categorical drivers.
Balances complexity through Ridge regularisation, yielding low RMSE and clear coefficient insights for decision-makers.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

TV Production Budget Prediction with Polynomial Regression in ML

Dataset

Step-by-Step Code Implementation

1. Libraries Required

2. Load Data & Inspect

3. Feature Engineering

4. Visualise Nonlinear Patterns

5. Build Polynomial Regression Pipeline

6. Train/Test Split & Hyperparameter Search

7. Evaluate Model

8. Interpret Key Polynomial Coefficients