Air Quality Prediction using Quantile Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Environmental planners and public health officials need to understand not just the average concentration of pollutants like PM₂.₅, but also the variability under different meteorological conditions—anticipating both clean‐air days (10th percentile) and hazardous events (90th percentile). In this project, we will predict the 10th, 50th, and 90th quantiles of PM₂.₅ concentration (µg/m³) based on co‑pollutants and weather features (PM₁₀, NO₂, O₃, SO₂, CO, temperature, humidity, wind speed). By fitting separate linear quantile regression models, we’ll reveal how drivers influence lower‑, median‑, and upper‑tail pollution levels—informing targeted interventions for air‑quality management.
Libraries Required
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf # Quantile regression via formula API
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_pinball_loss
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We import a 10,000‑row global air‑quality dataset containing hourly readings of PM₂.₅, co‑pollutants (PM₁₀, NO₂, O₃, SO₂, CO), and meteorological variables (temperature, humidity, wind speed). Initial .info() and .describe() calls verify data types, ranges, and missingness.
# Load the Global Air Quality dataset (10,000 records) :contentReference[oaicite:0]{index=0}
df = pd.read_csv("global_air_quality_data_10000.csv")
# Inspect structure and key columns
print(df.head())
print(df.info())
print(df.describe())
Preprocessing & Feature Selection
- We drop any records missing core pollutant or weather measurements to ensure clean modeling.
- We rename PM2.5 to PM25 and WindSpeed to Wind for syntactic simplicity in formulas.
- The predictor set features comprises the five co‑pollutants and three weather variables; the response is hourly PM₂.₅ concentration.
# Drop any rows with missing values in core features
df = df.dropna(subset=[
'PM2.5','PM10','NO2','O3','SO2','CO',
'Temperature','Humidity','WindSpeed'
])
# Rename columns for ease of use
df = df.rename(columns={
'PM2.5': 'PM25',
'WindSpeed': 'Wind'
})
# Define predictor list and response
features = ['PM10','NO2','O3','SO2','CO','Temperature','Humidity','Wind']
data = df[features + ['PM25']]
Train/Test Split
We randomly reserve 20% of the data for out‑of‑sample evaluation, however, ensuring our quantile models generalize to unseen conditions.
# Hold out 20% for evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each quantile (10th, 50th, 90th percentiles):
- We build a formula string linking PM25 to all predictors.
- We fit a QuantReg model at that percentile using statsmodels.
- Thus, we extract and print the coefficient table, revealing how each feature’s effect on PM₂.₅ differs across the pollution distribution (e.g., humidity might suppress high‑end PM₂.₅ more than median levels).
quantiles = [0.1, 0.5, 0.9]
results = {}
formula = "PM25 ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # show coefficient table only
Evaluation with Pinball Loss
- We predict quantile‑specific PM₂.₅ values on the held‑out test set.
- We compute pinball loss for each quantile—a proper scoring rule for quantile forecasts—quantifying the average weighted penalties for under‑ and over‑prediction at each percentile. Hence, Lower losses indicate better quantile calibration.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['PM25'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
By applying quantile regression to air‑quality data, we capture the heterogeneous impacts of meteorology and co‑pollutants across the PM₂.₅ distribution. For instance, wind speed may strongly reduce extreme PM₂.₅ spikes (e.g., the 90th percentile) but have less effect on median levels.
Therefore, modelling the 10th, 50th, and 90th percentiles separately yields distribution‑aware forecasts, enabling environmental agencies to prepare for best‑case clean‑air periods, typical conditions, and worst‑case pollution events—thus guiding air‑quality alerts, emission controls, and public health advisories with targeted precision.