Retail Expansion Cost Prediction with Ridge & Lasso Mixed Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
When a retail chain opens stores in a new city, managers must budget for parcel‑level outbound logistics—the immediate cost of shipping every order from the new warehouse to early‑adopter customers. Historical sales data show that the first‑year expansion shipping cost depends on product mix, order quantity, promotional discounts, and chosen carrier service. Because many of those variables move together—large orders ↔ bulk discounts ↔ higher shipping weight—ordinary least‑squares can blow up. At the same time, a pure Lasso model may over‑shrink genuinely useful features.
A mixed regression model—Elastic Net, a weighted blend of Ridge (ℓ²) and Lasso (ℓ¹) penalties—delivers a sparse and stable estimator of per‑order expansion cost. We will predict Shipping Cost (USD) for future orders using only information available at checkout (segment, region, category, quantity, discount, sales, ship mode).
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Evaluation | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Download & load dataset
Dataset: the Superstore file records 10,000+ historical orders, including categorical descriptors (ship mode, segment, region, product hierarchy) and monetary figures (sales, discount, shipping cost).
# One‑time terminal command (requires Kaggle API token):
# kaggle datasets download -d vivek468/superstore-dataset-final -p data --unzip
df = pd.read_csv("data/SampleSuperstore.csv") # file name inside the zip
3. Feature selection & quick EDA
During an expansion, outbound freight is one of the largest controllable OPEX items. Predicting it accurately at checkout lets the chain fine‑tune free‑shipping thresholds or carrier mixes for the new market.
# Keep only columns required for modelling
use_cols = ['Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category',
'Sales', 'Quantity', 'Discount', 'Shipping Cost']
data = df[use_cols].dropna()
# Target variable
y = data['Shipping Cost']
X = data.drop(columns='Shipping Cost')
# Visual sanity‑check
sns.histplot(y, kde=True); plt.title('Shipping‑Cost distribution'); plt.show()
4. Identify categorical vs numeric predictors
cat_cols = ['Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category'] num_cols = ['Sales', 'Quantity', 'Discount']
5. Pre‑processing & Elastic Net pipeline
ColumnTransformer one‑hot‑encodes categorical variables and z‑scales numeric ones inside each CV split, avoiding leakage and ensuring deploy‑time steps mirror training precisely.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=15000, random_state=42))
])
6. Train/test split & hyper‑parameter grid search
- alpha sets the overall penalty: higher α shrinks more.
- l1_ratio slides the mix between Ridge (robust to collinearity) and Lasso (feature selection).
- Searching 162 combinations (18 α × 9 ratios) with five‑fold CV finds the sweet spot that minimises RMSE.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
search = GridSearchCV(pipe,
param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
search.fit(X_train, y_train)
print("Best α:", search.best_params_['enet__alpha'])
print("Best l1_ratio:", search.best_params_['enet__l1_ratio'])
7. Evaluate on the hold‑out set
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.2f} per order | R²: {r2:.3f}")
8. Interpret top coefficients
The coefficient chart highlights that same‑day shipping can add $5 – $7 per parcel, bulk‑order quantities slightly drop per‑item cost, and high‑discount flash sales spike shipping fees due to heavier order bundles.
# Recover full feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# Reverse scaling for numeric coefficients
scales = search.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coefs = search.best_estimator_.named_steps['enet'].coef_
coefs[-len(num_cols):] = coefs[-len(num_cols):] / scales # back to $ units
imp = (pd.Series(coefs, index=feature_names)
.sort_values(key=abs, ascending=False))
plt.figure(figsize=(9,5))
imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – Drivers of Expansion Shipping Cost')
plt.xlabel('Δ Shipping Cost (USD)'); plt.tight_layout(); plt.show()
Summary
Using under 150 lines of well‑commented Python, we built a mixed (Elastic Net) regression pipeline that:
- Forecasts per‑order shipping cost—a proxy for first‑year expansion expenses—with low hold‑out RMSE.
- Automatically prunes unimportant dummies while retaining correlated but valuable predictors, thanks to the Ridge‑Lasso blend.
- Explains itself via apparent dollar‑impact coefficients, giving ops teams actionable levers (carrier, ship mode, product mix) to tweak before launch.
Because preprocessing, hyper‑tuning, and inference live inside a single Pipeline, refreshing the model with the latest order data is a one‑liner: search.fit(new_X, new_y). Your next city‑launch budget just got a lot harder to blow.