Noise Pollution Prediction using Linear Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Urban-planning teams, public-health officials, and smart-city start-ups all need a quick way to forecast outdoor noise levels so they can design quiet zones, optimise traffic flow, and schedule construction work.

In this tutorial, we build a linear‑regression baseline that predicts a monitoring station’s average A‑weighted sound level (dB LAeq) from readily available covariates: city, land‑use zone (commercial / residential / industrial), month of year, day‑type (work‑day vs holiday), and concurrent meteorological conditions (temperature and wind speed). The fitted coefficients reveal which factors increase or decrease decibel levels, providing engineers with a transparent yardstick before deploying spatio-temporal or deep-audio models.

Libraries Required

  • pandas # tabular wrangling
  • numpy # numerical helpers
  • matplotlib.pyplot # quick sanity plots (optional)
  • scikit‑learn # preprocessing, model, metrics
  • joblib # persist the trained pipeline

Dataset Link

Noise Monitoring Data in India

Step-by-Step Code Implementation

Why linear regression? Within normal urban conditions, the average LAeq rises roughly linearly with traffic volume (captured by City × Area.Type) and meteorological factors (temperature, wind). A straight-line fit quantifies these elasticities in decibels, allowing city engineers to interpret them instantly.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the Data

We’ll use the open “Noise Monitoring Data in India (2011‑2018)” corpus, which aggregates monthly LAeq values for 70+ stations in seven large cities.

df = pd.read_csv("noise_data.csv")
print(df.head())

Key columns

column sample values
City Delhi / Kolkata …
Area.Type Residential / Commercial / Industrial
Month 1 … 12
Day.Type Day / Night
Temperature.C 22.4
WindSpeed.mps 3.1
LAeq.dB target – 71.3

3. Minimal cleaning & feature block

Standard scaling places the month index, temperature, and wind speed on equal variance, allowing coefficients to be directly comparable (dB per 1 σ change).

core = ['LAeq.dB','City','Area.Type','Month','Day.Type',
        'Temperature.C','WindSpeed.mps']
df   = df.dropna(subset=core).copy()

num_cols = ['Month','Temperature.C','WindSpeed.mps']
cat_cols = ['City','Area.Type','Day.Type']
target   = 'LAeq.dB'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing & linear‑regression pipeline

One-hot encoding prevents any artificial numeric order between cities or land-use zones while assigning each class a distinct dB offset.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', linreg)
])

5. Train‑test split & training

Because rows are monthly aggregates, shuffling is acceptable.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Evaluation

Performance metrics – R² indicates the percentage of month-to-month noise variation we capture; MAE in dB informs regulators of the typical prediction error (e.g., ±2.3 dB).

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} dB")

7. Interpret loudness drivers

The coefficient table highlights actionable levers: if Area.Type_Industrial adds ≈ 8 dB and high wind subtracts ≈ 1 dB, you know where mitigation (barriers, zoning) matters most.

ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
        .sort_values())

print("\nNoise‑reducing factors (negative coefficients):")
print(coef.head(6))

print("\nNoise‑increasing factors (positive coefficients):")
print(coef.tail(6))

Because numerics are z‑scored, each coefficient reads as dB change for a one‑σ shift in that feature; one‑hot coefficients are dB bumps relative to the reference level.

8. Persist the trained pipeline

Joblib persistence bundles preprocessing and coefficients in one file, so tomorrow’s dashboard can load .pkl, feed new weather + calendar + zone data, and output a decibel forecast in milliseconds.

joblib.dump(pipe, "noise_pollution_linreg.pkl")

Summary

With barely 120 lines of Python, we transformed raw monitoring logs into an explainable noise‑pollution forecaster:

  • Instant LAeq estimates help planners schedule roadwork and enforce zoning rules.
  • Transparent dB levers show exactly how land use, seasonality, and weather affect ambient sound levels.

Keep this linear baseline as your yardstick; when you pivot to spatio‑temporal kriging, gradient‑boosted trees, or deep spectrogram models, you’ll know precisely how much additional predictive punch each layer of complexity delivers.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *