Construction Cost Trend Prediction with Polynomial Regression in ML
FREE Online Courses: Click for Success, Learn for Free - Start Now!
Civil‑engineering and project‑management teams need to forecast the total construction cost (USD) of new projects—using only high‑level planning inputs such as project size, duration, building type, location, complexity rating, and start date—before detailed design and contractor bids. Historical project datasets reveal nonlinear cost behaviours: economies of scale taper off beyond certain floor‑area thresholds; remote locations amplify costs disproportionately; schedule compression drives steep premium surcharges. A plain linear regression underestimates these curvatures, while a naïve high‑degree polynomial overfits outlier bids. By applying Polynomial Regression to engineered features with Ridge (ℓ²) regularisation, we can capture smooth, nonlinear cost trends and deliver robust, interpretable forecasts for early budgeting and risk assessment.
Dataset
Construction Project Cost Data
Step-by-Step Code Implementation
1. Libraries Required
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
2. Load Data & Inspect
Target & features: We predict Actual_Cost from Estimated_Cost and Duration_Mo, plus categorical Region and Complexity.
import pandas as pd
df = pd.read_csv("construction_project_cost_data.csv")
df.head()[[
'Project','Estimated_Cost','Actual_Cost',
'Start_Date','End_Date','Complexity','Region'
]]
3. Feature Engineering
PolynomialFeatures augments inputs with squares and interactions (e.g., Estimated_Cost², Duration_Mo × Region_Midwest), capturing nonlinear scale and regional effects.
import numpy as np # Calculate project duration in months df['Start_Date'] = pd.to_datetime(df['Start_Date']) df['End_Date'] = pd.to_datetime(df['End_Date']) df['Duration_Mo'] = (df['End_Date'] - df['Start_Date']) / np.timedelta64(1, 'M') # Define cost overrun ratio as target df['Cost'] = df['Actual_Cost'] # we predict actual cost # Encode Region and Complexity categorical = ['Region','Complexity'] numeric = ['Estimated_Cost','Duration_Mo'] # Drop any rows with missing key values df = df.dropna(subset=numeric+categorical+['Cost'])
4. Build Polynomial Regression Pipeline
Preprocessing:
- StandardScaler z‑scores numeric features to equalise penalty treatment.
- OneHotEncoder converts region and complexity levels into dummy variables.
- Ridge regression applies an ℓ² penalty (alpha) to shrink noisy high‑order coefficients and prevent overfitting.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric),
('cat', OneHotEncoder(drop='first'), categorical)
])
pipe = Pipeline([
('prep', preprocessor),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
5. Train/Test Split & Hyperparameter Search
GridSearchCV tunes polynomial degree (1–3) and regularisation alpha (10⁻³–10³) with 5‑fold CV, optimising RMSE.
from sklearn.model_selection import train_test_split, GridSearchCV
X = df[numeric + categorical]
y = df['Cost']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best degree:", gs.best_params_['poly__degree'])
print("Best alpha :", gs.best_params_['ridge__alpha'])
6. Evaluate Mode
from sklearn.metrics import mean_squared_error, r2_score
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: ${rmse:,.0f}")
print(f"Test R² : {r2:.3f}")
7. Interpret Key Polynomial Coefficients
Coefficient inspection surfaces the strongest nonlinear drivers—highlighting, for example, the interaction between high estimated cost and high complexity as a key cost multiplier.
import pandas as pd
import matplotlib.pyplot as plt
# Reconstruct feature names
prep = gs.best_estimator_.named_steps['prep']
num_feats = numeric
cat_feats = prep.named_transformers_['cat'] \
.get_feature_names_out(categorical).tolist()
inputs = num_feats + cat_feats
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=inputs)
coefs = gs.best_estimator_.named_steps['ridge'].coef_
imp = pd.Series(coefs, index=feat_names).abs() \
.sort_values(ascending=False).head(10)
plt.figure(figsize=(8,5))
imp.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Construction Cost")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
This Polynomial Regression pipeline with Ridge regularisation delivers:
- Accurate construction cost forecasts (low RMSE, high R²) from early planning inputs.
- Captures economies of scale and premium surcharges through polynomial terms, while controlling complexity via α‑tuning.
- Interpretable insights: the most influential polynomial features reveal how estimated cost, duration, region, and complexity interact nonlinearly—enabling project teams to refine estimates and proactively manage risk.