Property Tax Value Prediction using Linear Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
County assessors in the United States publish a “property tax assessed value” for every parcel. Banks, insurers, and investors rely on that figure when pricing mortgages, estimating risk, or bundling real‑estate‑backed securities—yet many counties refresh the assessment only once a year.
This mini-project builds a linear-regression baseline that predicts a home’s current assessed value (in US dollars) from publicly available features such as living area, lot size, number of bedrooms, year built, and neighbourhood ZIP code. A fast, transparent model like this flags mis‑priced parcels for manual review and offers a yardstick before deploying heavier ML pipelines.
Libraries Required
- pandas # data wrangling
- numpy # vector maths
- matplotlib.pyplot # sanity plots
- seaborn # quick EDA heatmaps (optional)
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Zillow Prize – Home Value Prediction
Step-by-Step Implementation
Why linear regression? Tax assessors typically apply additive adjustments to a base land value—bedrooms add value, age subtracts—making a linear form a sensible first approximation.
1. Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load data
- One‑hot ZIP codes remove any false numeric ordering between postal zones; each zone gets its coefficient representing local market premiums.
cols = ['parcelid', 'yearbuilt', 'bedroomcnt', 'bathroomcnt',
'calculatedfinishedsquarefeet', 'lotsizesquarefeet',
'taxvaluedollarcnt', 'propertytaxvaluedollarcnt',
'structuretaxvaluedollarcnt', 'landtaxvaluedollarcnt',
'taxamount', 'regionidzip']
df = pd.read_csv('properties_2016.csv', usecols=cols, low_memory=False)
3. Minimal cleaning
Feature clipping drops the top and bottom 1 % of tax values, so rare typos ($100 M ranches or $0 listings) do not skew the fit.
# rename for clarity
df = df.rename(columns={'propertytaxvaluedollarcnt': 'target_tax_value'})
# drop rows with missing core info
df = df.dropna(subset=['target_tax_value',
'calculatedfinishedsquarefeet',
'bedroomcnt',
'bathroomcnt',
'regionidzip'])
# clip extreme outliers (optional, keeps model stable)
df = df[df['target_tax_value'].between(df['target_tax_value'].quantile(0.01),
df['target_tax_value'].quantile(0.99))]
4. Feature & label definition
num_cols = ['yearbuilt',
'bedroomcnt', 'bathroomcnt',
'calculatedfinishedsquarefeet', 'lotsizesquarefeet',
'structuretaxvaluedollarcnt', 'landtaxvaluedollarcnt',
'taxvaluedollarcnt', 'taxamount']
cat_cols = ['regionidzip'] # ZIP as categorical
X = df[num_cols + cat_cols]
y = df['target_tax_value']
5. Pre‑processing + model pipeline
ohe = OneHotEncoder(handle_unknown='ignore', sparse=True)
preproc = ColumnTransformer(
transformers=[('zip_ohe', ohe, cat_cols)],
remainder='passthrough'
)
lin_reg = LinearRegression(n_jobs=-1)
pipe = Pipeline(steps=[('prep', preproc),
('model', lin_reg)])
6. Train/test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
7. Evaluation
R² vs MAE — R² tells us the share of variance captured, while MAE (in dollars) is the actual cost of the error; both metrics together paint a realistic picture of model usefulness.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")
8. Coefficient inspection (top drivers)
Sparse matrix — The one‑hot ZIP block has thousands of columns but only one “1” per row; scikit‑learn’s sparse design keeps memory in check.
# Recover encoded feature names
zip_labels = pipe.named_steps['prep'].named_transformers_['zip_ohe']\
.get_feature_names_out(['zip'])
feature_names = list(zip_labels) + num_cols
coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
# Show ten largest absolute effects
print(coefs.reindex(coefs.abs().sort_values(ascending=False).index)[:10])
9. Persist pipeline
Pipeline persistence safeguards preprocessing order—vital when tomorrow’s batch scoring script loads unseen parcels.
joblib.dump(pipe, 'property_tax_value_linreg.pkl')
Summary
With nothing more exotic than pandas and scikit‑learn, you now have a working, explainable baseline for predicting property‑tax assessments across millions of U.S. homes. The model’s coefficients double as a heat‑map of value drivers—unlocking quick audits for over‑ or under‑assessed parcels. Keep this pipeline as a benchmark: when you later roll in gradient‑boosted trees, neighbourhood median income, or satellite imagery, you can quantify precisely how much each layer tightens the dollar‑error band.