Retail Revenue Prediction with Ridge & Lasso Mixed Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
A retail e‑commerce firm wants to forecast the total invoice revenue it will earn from each incoming order before the payment is processed. Features such as country, basket size, number of unique SKUs, season, and average item price are known as soon as the cart is finalised. However, strong multicollinearity among monetary variables (e.g., Quantity × UnitPrice, Basket Size) can destabilise an ordinary least‑squares model; pure Lasso may over‑shrink, and pure Ridge keeps every noisy term.
Elastic Net, which mixes Lasso’s ℓ¹ sparsity with Ridge’s ℓ² stability, offers the best of both worlds—yielding a model that is both interpretable and resistant to collinearity.
Libraries Required
| Role | Library |
| Data wrangling | pandas, numpy |
| Date handling | pandas‑datetime (built‑in) |
| Visuals | matplotlib, seaborn |
| Modelling pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Evaluation | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
Two years of 1 M+ e‑commerce transactions; after cleaning, invoices become rows, enriched with basket size and temporal flags.
# One‑time terminal (needs Kaggle API token):
# kaggle datasets download -d jillwang87/online-retail-ii -p data --unzip
# The file online_retail_II.xlsx has two sheets (2009‑2010 & 2010‑2011). Read both & stack.
df1 = pd.read_excel("data/online_retail_II.xlsx", sheet_name="Year 2009-2010")
df2 = pd.read_excel("data/online_retail_II.xlsx", sheet_name="Year 2010-2011")
data = pd.concat([df1, df2], ignore_index=True)
3. Clean + feature engineering
# Basic cleaning: drop returns (Invoice starts with 'C') & missing cust IDs
data = data[~data['Invoice'].astype(str).str.startswith('C')]
data = data.dropna(subset=['Customer ID'])
# Revenue per invoice line
data['LineRevenue'] = data['Quantity'] * data['Price']
# Aggregate to invoice level (prediction granularity)
agg = data.groupby(['Invoice', 'InvoiceDate', 'Country', 'Customer ID']).agg({
'Quantity': 'sum',
'StockCode': pd.Series.nunique,
'LineRevenue': 'sum',
'Price': 'mean'
}).reset_index().rename(columns={
'Quantity': 'TotalQty',
'StockCode': 'UniqueItems',
'Price': 'AvgUnitPrice',
'LineRevenue': 'Revenue' # target y
})
# Temporal splits
agg['InvoiceDate'] = pd.to_datetime(agg['InvoiceDate'])
agg['Month'] = agg['InvoiceDate'].dt.month
agg['Dow'] = agg['InvoiceDate'].dt.dayofweek
agg['IsWeekend'] = agg['Dow'].isin([5,6]).astype(int)
4. Define features & target
Revenue per invoice (sum of Quantity × UnitPrice).
y = agg['Revenue']
X = agg[['TotalQty', 'UniqueItems', 'AvgUnitPrice',
'Country', 'Month', 'IsWeekend']]
5. Pre‑processing & Elastic Net pipeline
TotalQty, UniqueItems, and AvgUnitPrice are correlated; Elastic Net’s blend mitigates instability while still zeroing weak dummies among ~40 country one‑hots.
num_cols = ['TotalQty', 'UniqueItems', 'AvgUnitPrice', 'Month', 'IsWeekend']
cat_cols = ['Country']
preprocess = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(drop='first'), cat_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=10000, random_state=42))
])
6. Train/test split & hyper‑parameter search
Scaling + encoding + model integrated to prevent leakage; GridSearchCV scans 135 hyper‑parameter pairs (15 α × 9 l1_ratio).
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {
'enet__alpha': np.logspace(-2, 2, 15), # overall penalty strength
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # mix: 0.1≈Ridge‑heavy, 0.9≈Lasso‑heavy
}
search = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])
7. Evaluate on the hold‑out set
RMSE in pounds sterling gives an intuitive error band; R2R^2 shows variance explained.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: £{rmse:,.2f} | R²: {r2:.3f}")
8. Inspect top coefficients
The coefficient bar‑plot reveals the biggest drivers (e.g., TotalQty, AvgUnitPrice, high‑value countries), guiding up‑sell tactics and pricing tweaks.
# Retrieve feature names after one‑hot encoding
ohe_names = search.best_estimator_.named_steps['prep'] \
.named_transformers_['cat'].get_feature_names_out(cat_cols)
feature_names = np.hstack([num_cols, ohe_names])
coefs = search.best_estimator_.named_steps['enet'].coef_
imp = pd.Series(coefs, index=feature_names).sort_values(key=abs, ascending=False)
plt.figure(figsize=(8,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – Retail Revenue Drivers')
plt.xlabel('Coefficient (Δ £ revenue)'); plt.show()
Summary
In just a few succinct steps, we produced an Elastic Net model that:
- Predicts invoice‑level revenue before checkout payment.
- Balances Ridge & Lasso, yielding a stable yet sparse model with easy‑to‑explain coefficients.
- Refreshes with one .fit() when new sales logs arrive, thanks to the end‑to‑end Pipeline.
Armed with these predictions, retail analysts can tailor promotions, trigger fraud checks on abnormally high projected orders, and optimise inventory for the most lucrative country–season combinations—all before the cash register rings.