Fuel Consumption Prediction using Linear Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Vehicle engineers and policy makers rely on accurate fuel‑consumption estimates to design efficient engines, set emissions targets, and inform buyers. Given basic engine specifications (size, cylinders), vehicle class, transmission type, and fuel type, we want to predict combined fuel consumption in L / 100 km for new light‑duty vehicles using an interpretable linear‑regression baseline. A transparent model reveals which specs drive efficiency and offers a benchmark before moving to more complex algorithms.
Libraries Required
- pandas # tabular wrangling
- numpy # numeric helpers
- matplotlib.pyplot # quick EDA charts
- seaborn # tidy correlation heatmaps (optional)
- scikit‑learn # train/test split, model, metrics
- joblib # save trained model
Dataset Link
Canadian Fuel Consumption & CO₂
Step by Step Code Implementation
Why linear regression?
Inside typical design limits, combined fuel use scales roughly linearly with engine displacement and cylinder count. A linear model exposes these first‑order correlations and provides coefficients that engineers can read in minutes.
Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # optional from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
Load data
df = pd.read_csv("FuelConsumptionCo2.csv")
print(df.head())
print(df.shape)
Minimal cleaning
# drop obvious duplicates or rows with critical blanks
df = df.dropna(subset=['Engine Size(L)', 'Cylinders',
'Fuel Type', 'Vehicle Class',
'Fuel Consumption Comb (L/100 km)']).copy()
Define features & label
target = 'Fuel Consumption Comb (L/100 km)' num_cols = ['Engine Size(L)', 'Cylinders'] cat_cols = ['Fuel Type', 'Vehicle Class', 'Transmission'] X = df[num_cols + cat_cols] y = df[target]
Pre‑processing & model pipeline
- One‑hot encoding avoids treating labels like “SUV” or “Automatic” as ordinal numbers—each becomes its own binary column, letting the model learn a clean adjustment per category.
- ColumnTransformer + Pipeline bundles preprocessing and model into one object, preventing training/inference mismatches and easing export to production.
# one‑hot encode the categorical columns
ohe = OneHotEncoder(handle_unknown='ignore')
preproc = ColumnTransformer([
('cat', ohe, cat_cols)
], remainder='passthrough') # numeric columns pass through unchanged
lin_model = LinearRegression()
pipeline = Pipeline(steps=[
('prep', preproc),
('model', lin_model)
])
Train/test split and training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
Evaluation
MAE in L / 100 km is intuitive: if MAE ≈ 0.5, our typical guess is within half a litre per 100 km—handy when deciding if the model is “good enough” for showroom labels.
y_pred = pipeline.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} L/100 km")
Inspect coefficients (top drivers)
The coefficient table pinpoints which transmissions, fuel types, or classes improve (negative) or worsen (positive) consumption. Product teams can use this to prioritise efficiency tweaks.
# recover feature names from OneHotEncoder
ohe_feats = pipeline.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coefs = pd.Series(pipeline.named_steps['model'].coef_,
index=all_feats).sort_values()
print("Most efficient specs (negative coefficients):")
print(coefs.head(10))
print("\nGas‑guzzling specs (positive coefficients):")
print(coefs.tail(10))
Persist the trained pipeline.
joblib.dump(pipeline, "fuel_consumption_linreg.pkl")
Summary
This walkthrough shows how to turn a publicly available vehicle-spec dataset into a transparent fuel-consumption estimator with little more than pandas and scikit-learn. After minimal cleaning and one‑hot encoding, the linear model captures core efficiency drivers—engine size, cylinder count, vehicle class—while giving planners a numeric error band they can act on. The same pipeline can feed deeper models (Ridge, Gradient Boosting, XGBoost) later, but starting with this interpretable baseline grounds the project in solid engineering insight.