Retail Expansion Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Enroll Now, Thank us Later!
Retail development teams and finance officers must forecast the capital cost of opening a new store—before signing leases or placing equipment orders—using early-stage site metrics such as trade-area population density, average household income, existing competitor count, store footprint (sq ft), and the regional construction-cost index. Expansion costs per location exhibit nonlinear effects (bulk procurement discounts vs. premium finishes in high‐income areas) and are subject to uncertainty from material price swings and permit delays. A simple point estimate obscures this risk. By applying Bayesian Regression, we obtain both:
1. A point estimate of forecasted expansion cost.
2. A credible interval quantifying uncertainty—enabling risk‑aware budgeting and site selection.
Libraries Required
import pandas as pd # data I/O & manipulation
import numpy as np # numerical operations
import matplotlib.pyplot as plt # plotting
import seaborn as sns # visualization
import pymc3 as pm # Bayesian modeling
import arviz as az # posterior analysis
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Data Loading & Feature Engineering
Log‐Transformation: We model log(Expansion_Cost) to stabilise variance and enforce positivity.
import pandas as pd
# Load simulated buildout data
df = pd.read_csv("data/construction-estimation-data.csv")
# Rename columns to retail‐expansion context
# Use Project_Size_sqft as Store_Sqft, Num_Stories as Stories,
# Complexity_Index as FitOut_Complexity, Region_Price_Index stays
df = df.rename(columns={
'Project_Size_sqft': 'Store_Sqft',
'Num_Stories': 'Stories',
'Complexity_Index': 'FitOut_Complexity',
'Region_Price_Index': 'Region_Cost_Index',
'Material_Cost': 'Material_Cost',
'Labor_Cost': 'Labor_Cost'
})
# Compute total cost as proxy for expansion cost
df['Expansion_Cost'] = df['Material_Cost'] + df['Labor_Cost']
# Select features and target
features = ['Store_Sqft','Stories','FitOut_Complexity','Region_Cost_Index']
X = df[features].values
y = df['Expansion_Cost'].values # USD
Train/Test Split & Standardisation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 80/20 random split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize predictors for stable MCMC sampling
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α centred on the empirical mean log‑cost ±2 σ.
- β ∼ Normal(0,1) on standardized features.
- σ ∼ HalfNormal(y_std) for residual log‑cost variation.
Model: log(cost) ∼ Normal(α + β·X_std, σ).
Inference: 2,000 posterior draws (1,000 tuning) with target_accept=0.9 ensure robust convergence.
import pymc3 as pm
with pm.Model() as expansion_model:
# Priors reflecting broad cost uncertainty
α = pm.Normal("α", mu=y_train.mean(), sigma=y_train.std()*2)
β = pm.Normal("β", mu=0, sigma=1, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=y_train.std())
# Linear predictor for log‐cost stabilization
μ = α + pm.math.dot(X_train_s, β)
# Likelihood: observing Expansion_Cost (in log‐space for scale)
Y_obs = pm.Normal("Y_obs", mu=μ, sigma=σ, observed=np.log(y_train))
# MCMC sampling
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
- Posterior Predictive: Sampling in log‑space yields point forecasts and 94% HPD intervals, converted back by exponentiation for interpretation in USD.
- Evaluation: MAE computed on the original cost scale.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize posterior distributions
az.summary(trace, round_to=2)
# Posterior predictive sampling
with expansion_model:
ppc = pm.sample_posterior_predictive(trace, var_names=["Y_obs"])
# Convert posterior means back from log‐space
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Predict and exponentiate to original scale
log_pred = α_post + X_test_s.dot(β_post)
y_pred = np.exp(log_pred)
# Evaluate MAE on cost scale
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:,.2f}")
Visualise Predictions & Credible Intervals
Sweeping Store_Sqft reveals how larger footprints drive cost and the uncertainty around that relationship.
import numpy as np
import matplotlib.pyplot as plt
# Sweep Store_Sqft while holding others at median
grid_vals = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.median(X_train_s, axis=0)[None,:].repeat(100, axis=0)
grid[:,0] = grid_vals
with expansion_model:
ppc_grid = pm.sample_posterior_predictive(
trace, var_names=["Y_obs"], samples=1000
)
# Compute mean and 94% credible interval in log‐space, then exp()
preds_log = ppc_grid["Y_obs"]
mean_log = preds_log.mean(axis=0)
hpd_log = az.hdi(preds_log, hdi_prob=0.94)
mean_cost = np.exp(mean_log)
hpd_lower = np.exp(hpd_log[:,0])
hpd_upper = np.exp(hpd_log[:,1])
# Back‐transform Store_Sqft
sqft_orig = scaler.inverse_transform(grid)[:,0]
plt.figure(figsize=(8,5))
plt.plot(sqft_orig, mean_cost, label="Posterior mean")
plt.fill_between(sqft_orig, hpd_lower, hpd_upper, alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0],
y_test, color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Store Size (sq ft)")
plt.ylabel("Expansion Cost (USD)")
plt.title("Bayesian Regression: Cost vs. Store Sq ft")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression framework for Retail Expansion Cost Prediction provides:
1. Accurate point forecasts of store build‑out cost from early site metrics.
2. Credible intervals capturing uncertainty from construction‐price volatility and design complexity.
3. Actionable insights: real‑estate and finance teams can set contingencies, negotiate contractor bids, and select sites with explicit cost‐risk bounds—optimising expansion spend under uncertainty.