Retail Revenue Prediction with Ridge & Lasso Mixed Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

A retail e‑commerce firm wants to forecast the total invoice revenue it will earn from each incoming order before the payment is processed. Features such as country, basket size, number of unique SKUs, season, and average item price are known as soon as the cart is finalised. However, strong multicollinearity among monetary variables (e.g., Quantity × UnitPrice, Basket Size) can destabilise an ordinary least‑squares model; pure Lasso may over‑shrink, and pure Ridge keeps every noisy term.

Elastic Net, which mixes Lasso’s ℓ¹ sparsity with Ridge’s ℓ² stability, offers the best of both worlds—yielding a model that is both interpretable and resistant to collinearity.

Libraries Required

Role	Library
Data wrangling	pandas, numpy
Date handling	pandas‑datetime (built‑in)
Visuals	matplotlib, seaborn
Modelling pipeline	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Evaluation	mean_squared_error, r2_score

Dataset Link

Online Retail II (UCI) o

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Download and load the dataset

Two years of 1 M+ e‑commerce transactions; after cleaning, invoices become rows, enriched with basket size and temporal flags.

# One‑time terminal (needs Kaggle API token):
# kaggle datasets download -d jillwang87/online-retail-ii -p data --unzip

# The file online_retail_II.xlsx has two sheets (2009‑2010 & 2010‑2011). Read both & stack.
df1 = pd.read_excel("data/online_retail_II.xlsx", sheet_name="Year 2009-2010")
df2 = pd.read_excel("data/online_retail_II.xlsx", sheet_name="Year 2010-2011")
data = pd.concat([df1, df2], ignore_index=True)

3. Clean + feature engineering

# Basic cleaning: drop returns (Invoice starts with 'C') & missing cust IDs
data = data[~data['Invoice'].astype(str).str.startswith('C')]
data = data.dropna(subset=['Customer ID'])

# Revenue per invoice line
data['LineRevenue'] = data['Quantity'] * data['Price']

# Aggregate to invoice level (prediction granularity)
agg = data.groupby(['Invoice', 'InvoiceDate', 'Country', 'Customer ID']).agg({
    'Quantity': 'sum',
    'StockCode': pd.Series.nunique,
    'LineRevenue': 'sum',
    'Price': 'mean'
}).reset_index().rename(columns={
    'Quantity': 'TotalQty',
    'StockCode': 'UniqueItems',
    'Price': 'AvgUnitPrice',
    'LineRevenue': 'Revenue'      # target y
})

# Temporal splits
agg['InvoiceDate'] = pd.to_datetime(agg['InvoiceDate'])
agg['Month']  = agg['InvoiceDate'].dt.month
agg['Dow']    = agg['InvoiceDate'].dt.dayofweek
agg['IsWeekend'] = agg['Dow'].isin([5,6]).astype(int)

4. Define features & target

Revenue per invoice (sum of Quantity × UnitPrice).

y = agg['Revenue']

X = agg[['TotalQty', 'UniqueItems', 'AvgUnitPrice',
         'Country', 'Month', 'IsWeekend']]

5. Pre‑processing & Elastic Net pipeline

TotalQty, UniqueItems, and AvgUnitPrice are correlated; Elastic Net’s blend mitigates instability while still zeroing weak dummies among ~40 country one‑hots.

num_cols = ['TotalQty', 'UniqueItems', 'AvgUnitPrice', 'Month', 'IsWeekend']
cat_cols = ['Country']

preprocess = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(drop='first'), cat_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=10000, random_state=42))
])

6. Train/test split & hyper‑parameter search

Scaling + encoding + model integrated to prevent leakage; GridSearchCV scans 135 hyper‑parameter pairs (15 α × 9 l1_ratio).

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

param_grid = {
    'enet__alpha': np.logspace(-2, 2, 15),        # overall penalty strength
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)    # mix: 0.1≈Ridge‑heavy, 0.9≈Lasso‑heavy
}

search = GridSearchCV(pipe, param_grid, cv=5,
                      scoring='neg_root_mean_squared_error',
                      n_jobs=-1, verbose=1)
search.fit(X_train, y_train)

print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])

7. Evaluate on the hold‑out set

RMSE in pounds sterling gives an intuitive error band; R2R^2 shows variance explained.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: £{rmse:,.2f} | R²: {r2:.3f}")

8. Inspect top coefficients

The coefficient bar‑plot reveals the biggest drivers (e.g., TotalQty, AvgUnitPrice, high‑value countries), guiding up‑sell tactics and pricing tweaks.

# Retrieve feature names after one‑hot encoding
ohe_names = search.best_estimator_.named_steps['prep'] \
               .named_transformers_['cat'].get_feature_names_out(cat_cols)
feature_names = np.hstack([num_cols, ohe_names])

coefs = search.best_estimator_.named_steps['enet'].coef_
imp   = pd.Series(coefs, index=feature_names).sort_values(key=abs, ascending=False)

plt.figure(figsize=(8,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – Retail Revenue Drivers')
plt.xlabel('Coefficient (Δ £ revenue)'); plt.show()

Summary

In just a few succinct steps, we produced an Elastic Net model that:

Predicts invoice‑level revenue before checkout payment.
Balances Ridge & Lasso, yielding a stable yet sparse model with easy‑to‑explain coefficients.
Refreshes with one .fit() when new sales logs arrive, thanks to the end‑to‑end Pipeline.

Armed with these predictions, retail analysts can tailor promotions, trigger fraud checks on abnormally high projected orders, and optimise inventory for the most lucrative country–season combinations—all before the cash register rings.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

Retail Revenue Prediction with Ridge & Lasso Mixed Regression in ML

Libraries Required

Dataset Link

Step-by-Step Code Implementation

1. Import Libraries

2. Download and load the dataset

3. Clean + feature engineering

4. Define features & target

5. Pre‑processing & Elastic Net pipeline