Ad Placement Cost Prediction with Lasso Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Media buyers juggle dozens of variables—audience traits, creative formats, bid strategy—yet they seldom know beforehand what a single placement on a major ad network will cost. This project builds a Lasso‑regularised linear model that:
- Predicts the placement cost (USD) for a planned ad impression bundle using features such as ad type, campaign objective, target age‑band, gender, industry vertical, day‑part, device, and estimated reach.
- Isolates the handful of drivers that truly inflate or deflate cost, because Lasso’s ℓ¹ penalty shrinks uninformative coefficients to zero—giving planners an immediate, interpretable shortlist for budget optimisation.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML workflow | scikit‑learn (Lasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV) |
| Evaluation | mean_squared_error, r2_score |
Dataset Link
Online Advertising Digital Marketing
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
Dataset logs three months of performance data for an e‑commerce brand, including audience parameters, creative type, bid strategy, impressions, clicks, conversions, and spend. The last field becomes our direct placement cost target.
# One–time command (needs Kaggle API token):
# kaggle datasets download -d naniruddhan/online-advertising-digital-marketing-data -p data --unzip
df = pd.read_csv("data/online_ad_performance.csv") # adjust filename if required
3. Target engineering
Assume the dataset records Spend and Impressions; the price of placing the ad block is Spend.
y = df['Spend'] # placement cost (USD) X = df.drop(columns=['Spend', 'Campaign_ID']) # drop leakage / ID
4. Pre‑processing pipeline
Categorical columns such as Age_Band, Gender, and Ad_Type are one‑hot encoded (dropping the first level to avoid dummy traps); numeric columns like Impressions, Clicks, and CTR are z‑scaled so the Lasso penalty treats every feature on equal footing.
cat_cols = X.select_dtypes(include='object').columns
num_cols = X.select_dtypes(exclude='object').columns
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
5. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['Ad_Type'])
6. Build and tune the Lasso pipeline
Wrapping scaling and modelling in a single Pipeline prevents data leakage between folds. A log‑spaced α sweep finds the optimal balance between sparsity and fit, using five‑fold CV for robustness.
pipe = Pipeline([
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-3, 1, 30)} # 0.001 → 10
grid = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(X_train, y_train)
print("Best α:", grid.best_params_['model__alpha'])
7. Evaluate model
RMSE communicates the average dollar error, easily grasped by media buyers, while R² reveals the variance explained.
y_pred = grid.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.2f} | R²: {r2:.3f}")
8. Interpret coefficients
The coefficient bar plot shows which knobs—e.g., mobile‑only placement, late‑night day‑part, broad interest targeting—inflate cost the most. Zeroed coefficients suggest a negligible impact, trimming the analyst’s focus.
ohe = grid.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([ohe_names, num_cols])
coef = grid.best_estimator_.named_steps['model'].coef_
imp = (pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,6))
imp.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Drivers of Ad Placement Cost (Lasso Coefficients)')
plt.xlabel('Coefficient (Δ USD)')
plt.show()
Summary
In under 120 lines of code, we produced an interpretable, cross‑validated pipeline that forecasts ad placement costs before a campaign goes live and delivers a ranked list of cost drivers. Media planners can plug hypothetical settings into the model, compare projected spend scenarios, and confidently allocate the budget toward the most efficient placements—turning ad buying from guesswork into a data-driven strategy.
Good