Property Tax Value Prediction using Linear Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

County assessors in the United States publish a “property tax assessed value” for every parcel. Banks, insurers, and investors rely on that figure when pricing mortgages, estimating risk, or bundling real‑estate‑backed securities—yet many counties refresh the assessment only once a year.

This mini-project builds a linear-regression baseline that predicts a home’s current assessed value (in US dollars) from publicly available features such as living area, lot size, number of bedrooms, year built, and neighbourhood ZIP code. A fast, transparent model like this flags mis‑priced parcels for manual review and offers a yardstick before deploying heavier ML pipelines.

Libraries Required

pandas # data wrangling
numpy # vector maths
matplotlib.pyplot # sanity plots
seaborn # quick EDA heatmaps (optional)
scikit‑learn # preprocessing, model, metrics
joblib # save the trained pipeline

Dataset Link

Zillow Prize – Home Value Prediction

Step-by-Step Implementation

Why linear regression? Tax assessors typically apply additive adjustments to a base land value—bedrooms add value, age subtracts—making a linear form a sensible first approximation.

1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load data

One‑hot ZIP codes remove any false numeric ordering between postal zones; each zone gets its coefficient representing local market premiums.

cols = ['parcelid', 'yearbuilt', 'bedroomcnt', 'bathroomcnt',
        'calculatedfinishedsquarefeet', 'lotsizesquarefeet',
        'taxvaluedollarcnt', 'propertytaxvaluedollarcnt',
        'structuretaxvaluedollarcnt', 'landtaxvaluedollarcnt',
        'taxamount', 'regionidzip']

df = pd.read_csv('properties_2016.csv', usecols=cols, low_memory=False)

3. Minimal cleaning

Feature clipping drops the top and bottom 1 % of tax values, so rare typos ($100 M ranches or $0 listings) do not skew the fit.

# rename for clarity
df = df.rename(columns={'propertytaxvaluedollarcnt': 'target_tax_value'})

# drop rows with missing core info
df = df.dropna(subset=['target_tax_value',
                       'calculatedfinishedsquarefeet',
                       'bedroomcnt',
                       'bathroomcnt',
                       'regionidzip'])

# clip extreme outliers (optional, keeps model stable)
df = df[df['target_tax_value'].between(df['target_tax_value'].quantile(0.01),
                                       df['target_tax_value'].quantile(0.99))]

4. Feature & label definition

num_cols = ['yearbuilt',
            'bedroomcnt', 'bathroomcnt',
            'calculatedfinishedsquarefeet', 'lotsizesquarefeet',
            'structuretaxvaluedollarcnt', 'landtaxvaluedollarcnt',
            'taxvaluedollarcnt', 'taxamount']

cat_cols = ['regionidzip']   # ZIP as categorical

X = df[num_cols + cat_cols]
y = df['target_tax_value']

5. Pre‑processing + model pipeline

ohe = OneHotEncoder(handle_unknown='ignore', sparse=True)

preproc = ColumnTransformer(
        transformers=[('zip_ohe', ohe, cat_cols)],
        remainder='passthrough'
)

lin_reg = LinearRegression(n_jobs=-1)

pipe = Pipeline(steps=[('prep', preproc),
                      ('model', lin_reg)])

6. Train/test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

pipe.fit(X_train, y_train)

7. Evaluation

R² vs MAE — R² tells us the share of variance captured, while MAE (in dollars) is the actual cost of the error; both metrics together paint a realistic picture of model usefulness.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")

8. Coefficient inspection (top drivers)

Sparse matrix — The one‑hot ZIP block has thousands of columns but only one “1” per row; scikit‑learn’s sparse design keeps memory in check.

# Recover encoded feature names
zip_labels = pipe.named_steps['prep'].named_transformers_['zip_ohe']\
                   .get_feature_names_out(['zip'])
feature_names = list(zip_labels) + num_cols

coefs = pd.Series(pipe.named_steps['model'].coef_, index=feature_names)
# Show ten largest absolute effects
print(coefs.reindex(coefs.abs().sort_values(ascending=False).index)[:10])

9. Persist pipeline

Pipeline persistence safeguards preprocessing order—vital when tomorrow’s batch scoring script loads unseen parcels.

joblib.dump(pipe, 'property_tax_value_linreg.pkl')

Summary

With nothing more exotic than pandas and scikit‑learn, you now have a working, explainable baseline for predicting property‑tax assessments across millions of U.S. homes. The model’s coefficients double as a heat‑map of value drivers—unlocking quick audits for over‑ or under‑assessed parcels. Keep this pipeline as a benchmark: when you later roll in gradient‑boosted trees, neighbourhood median income, or satellite imagery, you can quantify precisely how much each layer tightens the dollar‑error band.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook