Emission Level Prediction using Linear Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Governments and automakers track tailpipe CO₂ emissions to set efficiency targets, levy taxes, and guide R&D. Given a car’s readily available specifications—engine size, number of cylinders, vehicle class, transmission type, and fuel type—we want to predict its certified CO₂ emission level (g km‑¹).

A transparent linear-regression baseline tells engineers exactly how strongly each specification nudges emissions and serves as a reality check before rolling out more complex, non-linear models.

  Libraries Required

pandas # tabular wrangling
numpy # numeric helpers
matplotlib.pyplot # quick EDA plots
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

CO2 Emission of Vehicles in Canada

Step-by-Step Code Implementation

Why linear regression? Within normal engine‑design ranges, tail‑pipe CO₂ scales almost linearly with displacement and cylinder count. An additive categorical bump for transmissions or fuel types mirrors how regulatory labs compute emission factors.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load data

(Download the CSV, unzip, then point to the path.)

df = pd.read_csv("CO2_Emissions_Canada.csv")
print(df.head())

Expected key columns

column	sample contents
Make	Ford, BMW …
Vehicle Class	SUV, Compact …
Engine Size(L)	3.5
Cylinders	6
Transmission	A6, M5 …
Fuel Type	X, Z, D, E …
CO2 Emissions(g/km)	256

3.  Minimal cleaning

df = df.dropna(subset=['Engine Size(L)', 'Cylinders',
                       'Vehicle Class', 'Transmission',
                       'Fuel Type', 'CO2 Emissions(g/km)'])

4. Define features & label

Standard scaling puts engine size and cylinders on comparable variance, so coefficients read as g km‑¹ per 1 σ change—easy to discuss in engineering reviews.

num_cols = ['Engine Size(L)', 'Cylinders']
cat_cols = ['Vehicle Class', 'Transmission', 'Fuel Type']
target   = 'CO2 Emissions(g/km)'

X = df[num_cols + cat_cols]
y = df[target]

5. Pre‑processing & model pipeline

One‑hot encoding treats class, gearbox, and fuel as nominal categories rather than imposing a fake numeric order.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', linreg)
])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

pipe.fit(X_train, y_train)

7. Performance metrics

R² tells us the share of emission variance captured; MAE shows the typical absolute error—e.g., ±12 g km‑¹.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} g/km")

8. Inspect emission drivers

Coefficient table instantly surfaces levers: a significant positive weight for “SUV” or “Automatic‑8‑speed” quantifies their real CO₂ penalty, arming product planners with complex numbers.

# recover one‑hot names
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coef      = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
               .sort_values()

print("\nSpecs that lower emissions (negative coeffs):")
print(coef.head(10))

print("\nSpecs that raise emissions (positive coeffs):")
print(coef.tail(10))

9.  Persist the pipeline

Pipeline persistence freezes both the encoder and regression weights, allowing tomorrow’s design prototype to be scored instantly without retraining.

joblib.dump(pipe, "co2_emission_linreg.pkl")

Summary

With fewer than a hundred lines of Python, we converted raw vehicle specs into an explainable CO₂‑emission estimator.

The linear model:

Delivers fast, reproducible emission forecasts for early‑stage design or fleet compliance checks.
Reveals transparent marginal impacts—showing exactly how much each litre of engine size or each gearbox type nudges emissions up or down.

Keep this interpretable baseline as your benchmark; when you later incorporate hybrid‑electric indicators, curb weight, or switch to gradient‑boosted trees, you’ll know precisely how much real predictive power the extra complexity adds.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook