Air Quality Prediction using Quantile Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

Environmental planners and public health officials need to understand not just the average concentration of pollutants like PM₂.₅, but also the variability under different meteorological conditions—anticipating both clean‐air days (10th percentile) and hazardous events (90th percentile). In this project, we will predict the 10th, 50th, and 90th quantiles of PM₂.₅ concentration (µg/m³) based on co‑pollutants and weather features (PM₁₀, NO₂, O₃, SO₂, CO, temperature, humidity, wind speed). By fitting separate linear quantile regression models, we’ll reveal how drivers influence lower‑, median‑, and upper‑tail pollution levels—informing targeted interventions for air‑quality management.

Libraries Required

import pandas as pd  
import numpy as np  
import statsmodels.formula.api as smf     # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss

Dataset

Global Air Quality Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We import a 10,000‑row global air‑quality dataset containing hourly readings of PM₂.₅, co‑pollutants (PM₁₀, NO₂, O₃, SO₂, CO), and meteorological variables (temperature, humidity, wind speed). Initial .info() and .describe() calls verify data types, ranges, and missingness.

# Load the Global Air Quality dataset (10,000 records) :contentReference[oaicite:0]{index=0}
df = pd.read_csv("global_air_quality_data_10000.csv")

# Inspect structure and key columns
print(df.head())
print(df.info())
print(df.describe())

Preprocessing & Feature Selection

We drop any records missing core pollutant or weather measurements to ensure clean modeling.
We rename PM2.5 to PM25 and WindSpeed to Wind for syntactic simplicity in formulas.
The predictor set features comprises the five co‑pollutants and three weather variables; the response is hourly PM₂.₅ concentration.

# Drop any rows with missing values in core features
df = df.dropna(subset=[
    'PM2.5','PM10','NO2','O3','SO2','CO',
    'Temperature','Humidity','WindSpeed'
])

# Rename columns for ease of use
df = df.rename(columns={
    'PM2.5': 'PM25',
    'WindSpeed': 'Wind'
})

# Define predictor list and response
features = ['PM10','NO2','O3','SO2','CO','Temperature','Humidity','Wind']
data = df[features + ['PM25']]

Train/Test Split

We randomly reserve 20% of the data for out‑of‑sample evaluation, however, ensuring our quantile models generalize to unseen conditions.

# Hold out 20% for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each quantile (10th, 50th, 90th percentiles):

We build a formula string linking PM25 to all predictors.
We fit a QuantReg model at that percentile using statsmodels.
Thus, we extract and print the coefficient table, revealing how each feature’s effect on PM₂.₅ differs across the pollution distribution (e.g., humidity might suppress high‑end PM₂.₅ more than median levels).

quantiles = [0.1, 0.5, 0.9]
results   = {}
formula   = "PM25 ~ " + " + ".join(features)

for q in quantiles:
    model = smf.quantreg(formula, train)
    res   = model.fit(q=q)
    results[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # show coefficient table only

Evaluation with Pinball Loss

We predict quantile‑specific PM₂.₅ values on the held‑out test set.
We compute pinball loss for each quantile—a proper scoring rule for quantile forecasts—quantifying the average weighted penalties for under‑ and over‑prediction at each percentile. Hence, Lower losses indicate better quantile calibration.

for q, res in results.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['PM25'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

By applying quantile regression to air‑quality data, we capture the heterogeneous impacts of meteorology and co‑pollutants across the PM₂.₅ distribution. For instance, wind speed may strongly reduce extreme PM₂.₅ spikes (e.g., the 90th percentile) but have less effect on median levels.

Therefore, modelling the 10th, 50th, and 90th percentiles separately yields distribution‑aware forecasts, enabling environmental agencies to prepare for best‑case clean‑air periods, typical conditions, and worst‑case pollution events—thus guiding air‑quality alerts, emission controls, and public health advisories with targeted precision.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook