Property Value Growth Prediction using Polynomial Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Real‑estate analysts and investors need to forecast future property‑value growth (%) for residential homes based on current market indicators—such as recent sale price, square footage, lot size, number of bedrooms/bathrooms, year built, and neighbourhood socio‑economic scores—before making purchase or development decisions. The relationship between these predictors and price appreciation is nonlinear: e.g., diminishing returns from additional square footage and interactions between lot size and neighbourhood factors. A simple linear model underfits these curves, while an unconstrained high‑degree polynomial overfits noise. By applying Polynomial Regression to carefully engineered features with Ridge regularisation, we can capture smooth, nonlinear dependencies and deliver accurate, interpretable growth forecasts.
Libraries Required
import pandas as pd # data manipulation import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # enhanced visualization from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
House Prices – Advanced Regression Techniques
Step-by-Step Code Implementation
Load Libraries & Data
import pandas as pd
# Load training data
df = pd.read_csv("data/train.csv")
# Preview key columns
df.head()[['SalePrice','GrLivArea','LotArea','YearBuilt','OverallQual','Neighborhood']]
Feature Engineering & Target Definition
- Target engineering: we define PriceGrowthPct as the percentage increase over a proxy for ReplacementCost, capturing appreciation relative to the perceived baseline.
- Polynomial Features expands our five inputs into their squares and pairwise interactions—e.g., GrLivArea² and GrLivArea×OverallQual—capturing nonlinear effects such as diminishing returns and quality synergy.
# Compute property-value growth as percentage above replacement cost proxy: # here we define growth = (SalePrice - OverallQual*10000) / (OverallQual*10000) df['ReplacementCost'] = df['OverallQual'] * 10000 df['PriceGrowthPct'] = (df['SalePrice'] - df['ReplacementCost']) / df['ReplacementCost'] * 100 # Select predictors known at buy-in features = ['GrLivArea','LotArea','YearBuilt','OverallQual','OverallCond'] X = df[features] y = df['PriceGrowthPct']
Exploratory Data Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Nonlinear trend: living area vs growth
sns.scatterplot(x='GrLivArea', y='PriceGrowthPct', data=df, alpha=0.5)
plt.title("Living Area vs Price Growth (%)")
plt.xlabel("Above‑ground Living Area (sq ft)")
plt.ylabel("Price Growth (%)")
plt.show()
Build Polynomial Regression Pipeline
- StandardScaler normalises features, so Ridge’s ℓ² penalty treats them uniformly, avoiding dominance by larger‑variance terms.
- Ridge regression applies ℓ² regularisation to shrink noisy high‑order coefficients, controlling overfitting in the expanded feature space.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
Train/Test Split & Hyperparameter Search
GridSearchCV tunes polynomial degree (1–3) and regularisation strength α (10⁻³ to 10³) via 5‑fold CV, selecting the model that minimises RMSE on held‑out folds.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} % growth")
print(f"Test R² : {r2:.3f}")
Inspect Key Polynomial Coefficients
Coefficient inspection ranks the most influential polynomial terms—guiding investors on which features (e.g., large living area in high‑quality homes) most drive future growth.
# Retrieve polynomial feature names
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=features)
# Retrieve Ridge coefficients
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Driving Price Growth")
plt.xlabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
Summary
By integrating polynomial feature engineering with Ridge regularisation in a concise pipeline, we achieve:
1. Accurate nonlinear forecasts of property‑value growth (low RMSE, strong R²).
2. Controlled model complexity, avoiding overfitting while capturing critical curvature and interaction effects.
3. Interpretable insights: the top polynomial features highlight which combinations of size, quality, and age most influence appreciation, supporting data‑driven investment strategies.