Noise Pollution Prediction using Linear Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Urban-planning teams, public-health officials, and smart-city start-ups all need a quick way to forecast outdoor noise levels so they can design quiet zones, optimise traffic flow, and schedule construction work.
In this tutorial, we build a linear‑regression baseline that predicts a monitoring station’s average A‑weighted sound level (dB LAeq) from readily available covariates: city, land‑use zone (commercial / residential / industrial), month of year, day‑type (work‑day vs holiday), and concurrent meteorological conditions (temperature and wind speed). The fitted coefficients reveal which factors increase or decrease decibel levels, providing engineers with a transparent yardstick before deploying spatio-temporal or deep-audio models.
Libraries Required
- pandas # tabular wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick sanity plots (optional)
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Noise Monitoring Data in India
Step-by-Step Code Implementation
Why linear regression? Within normal urban conditions, the average LAeq rises roughly linearly with traffic volume (captured by City × Area.Type) and meteorological factors (temperature, wind). A straight-line fit quantifies these elasticities in decibels, allowing city engineers to interpret them instantly.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the Data
We’ll use the open “Noise Monitoring Data in India (2011‑2018)” corpus, which aggregates monthly LAeq values for 70+ stations in seven large cities.
df = pd.read_csv("noise_data.csv")
print(df.head())
Key columns
| column | sample values |
| City | Delhi / Kolkata … |
| Area.Type | Residential / Commercial / Industrial |
| Month | 1 … 12 |
| Day.Type | Day / Night |
| Temperature.C | 22.4 |
| WindSpeed.mps | 3.1 |
| LAeq.dB | target – 71.3 |
3. Minimal cleaning & feature block
Standard scaling places the month index, temperature, and wind speed on equal variance, allowing coefficients to be directly comparable (dB per 1 σ change).
core = ['LAeq.dB','City','Area.Type','Month','Day.Type',
'Temperature.C','WindSpeed.mps']
df = df.dropna(subset=core).copy()
num_cols = ['Month','Temperature.C','WindSpeed.mps']
cat_cols = ['City','Area.Type','Day.Type']
target = 'LAeq.dB'
X = df[num_cols + cat_cols]
y = df[target]
4. Pre‑processing & linear‑regression pipeline
One-hot encoding prevents any artificial numeric order between cities or land-use zones while assigning each class a distinct dB offset.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
5. Train‑test split & training
Because rows are monthly aggregates, shuffling is acceptable.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
6. Evaluation
Performance metrics – R² indicates the percentage of month-to-month noise variation we capture; MAE in dB informs regulators of the typical prediction error (e.g., ±2.3 dB).
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} dB")
7. Interpret loudness drivers
The coefficient table highlights actionable levers: if Area.Type_Industrial adds ≈ 8 dB and high wind subtracts ≈ 1 dB, you know where mitigation (barriers, zoning) matters most.
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
.sort_values())
print("\nNoise‑reducing factors (negative coefficients):")
print(coef.head(6))
print("\nNoise‑increasing factors (positive coefficients):")
print(coef.tail(6))
Because numerics are z‑scored, each coefficient reads as dB change for a one‑σ shift in that feature; one‑hot coefficients are dB bumps relative to the reference level.
8. Persist the trained pipeline
Joblib persistence bundles preprocessing and coefficients in one file, so tomorrow’s dashboard can load .pkl, feed new weather + calendar + zone data, and output a decibel forecast in milliseconds.
joblib.dump(pipe, "noise_pollution_linreg.pkl")
Summary
With barely 120 lines of Python, we transformed raw monitoring logs into an explainable noise‑pollution forecaster:
- Instant LAeq estimates help planners schedule roadwork and enforce zoning rules.
- Transparent dB levers show exactly how land use, seasonality, and weather affect ambient sound levels.
Keep this linear baseline as your yardstick; when you pivot to spatio‑temporal kriging, gradient‑boosted trees, or deep spectrogram models, you’ll know precisely how much additional predictive punch each layer of complexity delivers.