Building Material Cost Prediction using Linear Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Contractors, quantity surveyors, and DIY homeowners all face a common question: “How much will the materials cost?” Even a small under‑estimate can torpedo a project’s profit margin, while over‑budgeting may scare away clients.
Using a public construction-estimation dataset, we’ll build a linear-regression baseline that predicts the total material cost (USD) for a building project from blueprint-level inputs—built-up area, wall height, concrete volume, steel weight, and finishing quality. A transparent model exposes first-order cost drivers and provides a reality check before deeper, nonlinear models or parametric BIM tools are implemented.
Libraries Required
- pandas # data loading & wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick sanity plots
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression first? Within normal design envelopes, material cost tends to scale nearly linearly with physical quantities—twice the concrete roughly doubles the cement bill. A straight-line fit clearly quantifies those marginal costs.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
df = pd.read_csv("construction_estimation.csv")
print(df.head())
Typical columns in the dataset
Area_m2, Wall_Height_m, Concrete_m3, Steel_kg, Finishing_Quality, Material_Cost_USD
3. Minimal cleaning
Standard scaling on numeric inputs puts area (m²) and steel (kg) on comparable variance, so coefficients read as dollars per one‑standard‑deviation change—easy to explain to non‑statisticians.
# keep rows with all critical data present
core = ['Area_m2', 'Wall_Height_m', 'Concrete_m3',
'Steel_kg', 'Finishing_Quality', 'Material_Cost_USD']
df = df.dropna(subset=core).copy()
4. Define features & label
num_cols = ['Area_m2', 'Wall_Height_m', 'Concrete_m3', 'Steel_kg'] cat_cols = ['Finishing_Quality'] # e.g., Basic / Standard / Premium target = 'Material_Cost_USD' X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & model pipeline
One‑hot encoding treats Finishing_Quality as purely categorical, avoiding any false numeric order between “Standard” and “Premium.”
preprocess = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline(steps=[
('prep', preprocess),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Performance metrics
R² expresses the share of cost variance explained, while MAE tells estimators the typical dollar error. If MAE ≈ $ is $3,400, bids can include that buffer with eyes open.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")
8. Inspect cost drivers
Coefficient table surfaces levers: a $120 per‑σ bump for Concrete_m3 warns that concrete dominates the budget, while a large premium‑quality flag quantifies upgrade surcharges.
# recover feature names after one‑hot encoding
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_,
index=all_feats).sort_values()
print("\nCost‑reducing factors (negative coefficients):")
print(coefs.head(5))
print("\nCost‑increasing factors (positive coefficients):")
print(coefs.tail(5))
9. Persist the pipeline
Pipeline persistence (joblib.dump) locks preprocessing and regression weights together, ensuring tomorrow’s estimate pipeline can be called from a web form or BIM plugin with zero re‑training.
joblib.dump(pipe, "building_material_cost_linreg.pkl")
Summary
In fewer than 80 lines of Python code, we transformed raw blueprint metrics into an explainable building-material cost estimator. The linear model delivers two immediate wins:
- Fast, reproducible cost forecasts during the bidding or design phase.
- Crystal‑clear marginal costs that tell engineers exactly which quantities or finish upgrades drive the budget north.
Keep this transparent baseline as a benchmark; when you later fold in supplier price indices, regional labour multipliers, or pivot to gradient‑boosted trees, you’ll know precisely how much extra accuracy the added complexity buys.