Construction Material Cost Prediction using Bayesian Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Project managers and cost engineers need to predict the total material cost for a new building project—before procurement—using early‑stage indicators such as project floor area, number of stories, structural complexity index, regional price index, and estimated labour cost. Material costs scale nonlinearly with size and complexity (e.g., bulk-order discounts, premium finishes) and are subject to market price volatility. A single point estimate risks budget overruns or overly conservative bids. By applying Bayesian Regression, we obtain both:
1. A point estimate of expected material cost.
2. A credible interval quantifying forecast uncertainty—enabling risk‐aware budgeting and procurement strategy.
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization import pymc3 as pm # Bayesian modeling import arviz as az # posterior analysis from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error
Dataset
Step-by-Step Code Implementation
Import Libraries & Load Data
import pandas as pd
# Load dataset
df = pd.read_csv("data/construction-estimation-data.csv")
# Preview key columns
df[['Project_Size_sqft','Num_Stories','Complexity_Index',
'Region_Price_Index','Labor_Cost','Material_Cost']].head()
Preprocessing & Train/Test Split
We z‑score all predictors so the priors on β can operate uniformly across features with different units
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Define predictors and target
feature_cols = [
'Project_Size_sqft',
'Num_Stories',
'Complexity_Index',
'Region_Price_Index',
'Labor_Cost'
]
X = df[feature_cols].values
y = df['Material_Cost'].values # USD
# Split data (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize numeric features for stable MCMC
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
Define & Fit Bayesian Regression Model
Priors:
- α ∼ Normal(0, 1e5) allows broad intercept shifts for large‐scale USD costs.
- β ∼ Normal(0, 1e4) encodes moderate uncertainty on each predictor’s effect.
- σ ∼ HalfNormal(1e5) permits large residual variability reflecting market volatility.
Model: The linear predictor μ = α + β·X_standardized links project attributes to material cost, with observed costs modelled as Normal(μ, σ).
Inference: We draw 2,000 posterior samples (post 1,000 tuning) with target_accept=0.9 for robust convergence. Posterior predictive sampling yields full predictive distributions for new projects.
import pymc3 as pm
with pm.Model() as model:
# Priors
α = pm.Normal("α", mu=0, sigma=1e5)
β = pm.Normal("β", mu=0, sigma=1e4, shape=X_train_s.shape[1])
σ = pm.HalfNormal("σ", sigma=1e5)
# Linear predictor
μ = α + pm.math.dot(X_train_s, β)
# Likelihood
MaterialCost = pm.Normal("MaterialCost", mu=μ, sigma=σ, observed=y_train)
# MCMC sampling
trace = pm.sample(
draws=2000,
tune=1000,
target_accept=0.9,
return_inferencedata=True
)
Posterior Analysis & Point Predictions
Posterior means of α and β provide point forecasts; MAE on held‑out test data quantifies average error.
import arviz as az
from sklearn.metrics import mean_absolute_error
# Summarize posterior
az.summary(trace, round_to=2)
# Posterior predictive sampling
with model:
ppc = pm.sample_posterior_predictive(trace, var_names=["MaterialCost"])
# Extract posterior means
α_post = trace.posterior["α"].mean().item()
β_post = trace.posterior["β"].mean(dim=["chain","draw"]).values
# Point predictions on test set
y_pred = α_post + X_test_s.dot(β_post)
# Compute MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${mae:.2f}")
Visualise Predictions with Credible Intervals
By sweeping one feature (project size) and holding others fixed, we plot both the posterior mean cost curve and its 94% Highest Posterior Density interval—illuminating both expected trend and our uncertainty.
import numpy as np
import matplotlib.pyplot as plt
# Vary Project_Size_sqft; hold others at median
size_grid = np.linspace(X_train_s[:,0].min(), X_train_s[:,0].max(), 100)
grid = np.tile(np.median(X_train_s, axis=0), (100,1))
grid[:,0] = size_grid
with model:
pm.set_data({"MaterialCost": None})
# Note: here we'd set up a new shared variable for X; for brevity, assume it's handled
ppc_grid = pm.sample_posterior_predictive(trace, var_names=["MaterialCost"], samples=1000)
preds = ppc_grid["MaterialCost"]
mean_pred = preds.mean(axis=0)
hpd = az.hdi(preds, hdi_prob=0.94)
# Convert project size back to original scale
size_orig = scaler.inverse_transform(
np.column_stack([grid[:,0], grid[:,1], grid[:,2], grid[:,3], grid[:,4]])
)[:,0]
plt.figure(figsize=(8,5))
plt.plot(size_orig, mean_pred, label="Posterior mean")
plt.fill_between(size_orig, hpd[:,0], hpd[:,1], alpha=0.3,
label="94% credible interval")
plt.scatter(
scaler.inverse_transform(X_test_s)[:,0], y_test,
color="k", alpha=0.5, label="Test data"
)
plt.xlabel("Project Size (sqft)")
plt.ylabel("Material Cost (USD)")
plt.title("Bayesian Regression: Material Cost vs. Project Size")
plt.legend()
plt.tight_layout()
plt.show()
Summary
This Bayesian Regression pipeline for construction material cost forecasting delivers:
1. Accurate point estimates of material cost from early project specifications.
2. Credible intervals that quantify uncertainty from complexity and market variation—critical for risk‐aware budgeting.
3. Actionable insights: cost engineers can use both expected cost and uncertainty bounds to negotiate supplier contracts, set contingency reserves, and optimise procurement timing under uncertainty.