Emission Level Prediction using Linear Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Governments and automakers track tailpipe CO₂ emissions to set efficiency targets, levy taxes, and guide R&D. Given a car’s readily available specifications—engine size, number of cylinders, vehicle class, transmission type, and fuel type—we want to predict its certified CO₂ emission level (g km‑¹).
A transparent linear-regression baseline tells engineers exactly how strongly each specification nudges emissions and serves as a reality check before rolling out more complex, non-linear models.
Libraries Required
- pandas # tabular wrangling
- numpy # numeric helpers
- matplotlib.pyplot # quick EDA plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
CO2 Emission of Vehicles in Canada
Step-by-Step Code Implementation
Why linear regression? Within normal engine‑design ranges, tail‑pipe CO₂ scales almost linearly with displacement and cylinder count. An additive categorical bump for transmissions or fuel types mirrors how regulatory labs compute emission factors.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load data
(Download the CSV, unzip, then point to the path.)
df = pd.read_csv("CO2_Emissions_Canada.csv")
print(df.head())
Expected key columns
| column | sample contents |
| Make | Ford, BMW … |
| Vehicle Class | SUV, Compact … |
| Engine Size(L) | 3.5 |
| Cylinders | 6 |
| Transmission | A6, M5 … |
| Fuel Type | X, Z, D, E … |
| CO2 Emissions(g/km) | 256 |
3. Minimal cleaning
df = df.dropna(subset=['Engine Size(L)', 'Cylinders',
'Vehicle Class', 'Transmission',
'Fuel Type', 'CO2 Emissions(g/km)'])
4. Define features & label
Standard scaling puts engine size and cylinders on comparable variance, so coefficients read as g km‑¹ per 1 σ change—easy to discuss in engineering reviews.
num_cols = ['Engine Size(L)', 'Cylinders'] cat_cols = ['Vehicle Class', 'Transmission', 'Fuel Type'] target = 'CO2 Emissions(g/km)' X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & model pipeline
One‑hot encoding treats class, gearbox, and fuel as nominal categories rather than imposing a fake numeric order.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
7. Performance metrics
R² tells us the share of emission variance captured; MAE shows the typical absolute error—e.g., ±12 g km‑¹.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} g/km")
8. Inspect emission drivers
Coefficient table instantly surfaces levers: a significant positive weight for “SUV” or “Automatic‑8‑speed” quantifies their real CO₂ penalty, arming product planners with complex numbers.
# recover one‑hot names
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coef = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
.sort_values()
print("\nSpecs that lower emissions (negative coeffs):")
print(coef.head(10))
print("\nSpecs that raise emissions (positive coeffs):")
print(coef.tail(10))
9. Persist the pipeline
Pipeline persistence freezes both the encoder and regression weights, allowing tomorrow’s design prototype to be scored instantly without retraining.
joblib.dump(pipe, "co2_emission_linreg.pkl")
Summary
With fewer than a hundred lines of Python, we converted raw vehicle specs into an explainable CO₂‑emission estimator.
The linear model:
- Delivers fast, reproducible emission forecasts for early‑stage design or fleet compliance checks.
- Reveals transparent marginal impacts—showing exactly how much each litre of engine size or each gearbox type nudges emissions up or down.
Keep this interpretable baseline as your benchmark; when you later incorporate hybrid‑electric indicators, curb weight, or switch to gradient‑boosted trees, you’ll know precisely how much real predictive power the extra complexity adds.