Noise Pollution Prediction using Linear Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

Urban-planning teams, public-health officials, and smart-city start-ups all need a quick way to forecast outdoor noise levels so they can design quiet zones, optimise traffic flow, and schedule construction work.

In this tutorial, we build a linear‑regression baseline that predicts a monitoring station’s average A‑weighted sound level (dB LAeq) from readily available covariates: city, land‑use zone (commercial / residential / industrial), month of year, day‑type (work‑day vs holiday), and concurrent meteorological conditions (temperature and wind speed). The fitted coefficients reveal which factors increase or decrease decibel levels, providing engineers with a transparent yardstick before deploying spatio-temporal or deep-audio models.

Libraries Required

pandas # tabular wrangling
numpy # numerical helpers
matplotlib.pyplot # quick sanity plots (optional)
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Noise Monitoring Data in India

Step-by-Step Code Implementation

Why linear regression? Within normal urban conditions, the average LAeq rises roughly linearly with traffic volume (captured by City × Area.Type) and meteorological factors (temperature, wind). A straight-line fit quantifies these elasticities in decibels, allowing city engineers to interpret them instantly.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the Data

We’ll use the open “Noise Monitoring Data in India (2011‑2018)” corpus, which aggregates monthly LAeq values for 70+ stations in seven large cities.

df = pd.read_csv("noise_data.csv")
print(df.head())

Key columns

column	sample values
City	Delhi / Kolkata …
Area.Type	Residential / Commercial / Industrial
Month	1 … 12
Day.Type	Day / Night
Temperature.C	22.4
WindSpeed.mps	3.1
LAeq.dB	target – 71.3

3. Minimal cleaning & feature block

Standard scaling places the month index, temperature, and wind speed on equal variance, allowing coefficients to be directly comparable (dB per 1 σ change).

core = ['LAeq.dB','City','Area.Type','Month','Day.Type',
        'Temperature.C','WindSpeed.mps']
df   = df.dropna(subset=core).copy()

num_cols = ['Month','Temperature.C','WindSpeed.mps']
cat_cols = ['City','Area.Type','Day.Type']
target   = 'LAeq.dB'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing & linear‑regression pipeline

One-hot encoding prevents any artificial numeric order between cities or land-use zones while assigning each class a distinct dB offset.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', linreg)
])

5. Train‑test split & training

Because rows are monthly aggregates, shuffling is acceptable.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Evaluation

Performance metrics – R² indicates the percentage of month-to-month noise variation we capture; MAE in dB informs regulators of the typical prediction error (e.g., ±2.3 dB).

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} dB")

7. Interpret loudness drivers

The coefficient table highlights actionable levers: if Area.Type_Industrial adds ≈ 8 dB and high wind subtracts ≈ 1 dB, you know where mitigation (barriers, zoning) matters most.

ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
        .sort_values())

print("\nNoise‑reducing factors (negative coefficients):")
print(coef.head(6))

print("\nNoise‑increasing factors (positive coefficients):")
print(coef.tail(6))

Because numerics are z‑scored, each coefficient reads as dB change for a one‑σ shift in that feature; one‑hot coefficients are dB bumps relative to the reference level.

8. Persist the trained pipeline

Joblib persistence bundles preprocessing and coefficients in one file, so tomorrow’s dashboard can load .pkl, feed new weather + calendar + zone data, and output a decibel forecast in milliseconds.

joblib.dump(pipe, "noise_pollution_linreg.pkl")

Summary

With barely 120 lines of Python, we transformed raw monitoring logs into an explainable noise‑pollution forecaster:

Instant LAeq estimates help planners schedule roadwork and enforce zoning rules.
Transparent dB levers show exactly how land use, seasonality, and weather affect ambient sound levels.

Keep this linear baseline as your yardstick; when you pivot to spatio‑temporal kriging, gradient‑boosted trees, or deep spectrogram models, you’ll know precisely how much additional predictive punch each layer of complexity delivers.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook