Construction Cost Trend Prediction with Polynomial Regression in ML

FREE Online Courses: Click for Success, Learn for Free - Start Now!

Civil‑engineering and project‑management teams need to forecast the total construction cost (USD) of new projects—using only high‑level planning inputs such as project size, duration, building type, location, complexity rating, and start date—before detailed design and contractor bids. Historical project datasets reveal nonlinear cost behaviours: economies of scale taper off beyond certain floor‑area thresholds; remote locations amplify costs disproportionately; schedule compression drives steep premium surcharges. A plain linear regression underestimates these curvatures, while a naïve high‑degree polynomial overfits outlier bids. By applying Polynomial Regression to engineered features with Ridge (ℓ²) regularisation, we can capture smooth, nonlinear cost trends and deliver robust, interpretable forecasts for early budgeting and risk assessment.

Dataset

Construction Project Cost Data

Step-by-Step Code Implementation

1. Libraries Required

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

2. Load Data & Inspect

Target & features: We predict Actual_Cost from Estimated_Cost and Duration_Mo, plus categorical Region and Complexity.

import pandas as pd

df = pd.read_csv("construction_project_cost_data.csv")
df.head()[[
    'Project','Estimated_Cost','Actual_Cost',
    'Start_Date','End_Date','Complexity','Region'
]]

3. Feature Engineering

PolynomialFeatures augments inputs with squares and interactions (e.g., Estimated_Cost², Duration_Mo × Region_Midwest), capturing nonlinear scale and regional effects.

import numpy as np

# Calculate project duration in months
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date']   = pd.to_datetime(df['End_Date'])
df['Duration_Mo'] = (df['End_Date'] - df['Start_Date']) / np.timedelta64(1, 'M')

# Define cost overrun ratio as target
df['Cost'] = df['Actual_Cost']  # we predict actual cost

# Encode Region and Complexity
categorical = ['Region','Complexity']
numeric     = ['Estimated_Cost','Duration_Mo']

# Drop any rows with missing key values
df = df.dropna(subset=numeric+categorical+['Cost'])

4. Build Polynomial Regression Pipeline

Preprocessing:

StandardScaler z‑scores numeric features to equalise penalty treatment.
OneHotEncoder converts region and complexity levels into dummy variables.
Ridge regression applies an ℓ² penalty (alpha) to shrink noisy high‑order coefficients and prevent overfitting.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric),
    ('cat', OneHotEncoder(drop='first'), categorical)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ridge', Ridge(random_state=42))
])

5. Train/Test Split & Hyperparameter Search

GridSearchCV tunes polynomial degree (1–3) and regularisation alpha (10⁻³–10³) with 5‑fold CV, optimising RMSE.

from sklearn.model_selection import train_test_split, GridSearchCV

X = df[numeric + categorical]
y = df['Cost']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best degree:", gs.best_params_['poly__degree'])
print("Best alpha :", gs.best_params_['ridge__alpha'])

6. Evaluate Mode

from sklearn.metrics import mean_squared_error, r2_score

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: ${rmse:,.0f}")
print(f"Test R²  : {r2:.3f}")

7. Interpret Key Polynomial Coefficients

Coefficient inspection surfaces the strongest nonlinear drivers—highlighting, for example, the interaction between high estimated cost and high complexity as a key cost multiplier.

import pandas as pd
import matplotlib.pyplot as plt

# Reconstruct feature names
prep = gs.best_estimator_.named_steps['prep']
num_feats = numeric
cat_feats = prep.named_transformers_['cat'] \
               .get_feature_names_out(categorical).tolist()
inputs = num_feats + cat_feats

poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=inputs)
coefs = gs.best_estimator_.named_steps['ridge'].coef_

imp = pd.Series(coefs, index=feat_names).abs() \
         .sort_values(ascending=False).head(10)

plt.figure(figsize=(8,5))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Construction Cost")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()

Summary

This Polynomial Regression pipeline with Ridge regularisation delivers:

Accurate construction cost forecasts (low RMSE, high R²) from early planning inputs.
Captures economies of scale and premium surcharges through polynomial terms, while controlling complexity via α‑tuning.
Interpretable insights: the most influential polynomial features reveal how estimated cost, duration, region, and complexity interact nonlinearly—enabling project teams to refine estimates and proactively manage risk.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

Construction Cost Trend Prediction with Polynomial Regression in ML

Dataset

Step-by-Step Code Implementation

1. Libraries Required

2. Load Data & Inspect

3. Feature Engineering

4. Build Polynomial Regression Pipeline

5. Train/Test Split & Hyperparameter Search

6. Evaluate Mode

7. Interpret Key Polynomial Coefficients