Storage Cost Prediction using Linear Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Third-party logistics (3PL) operators charge customers a daily storage fee that depends on the pallet count, product class (ambient vs. chilled), cubic volume, and location. Being able to predict the fee upfront helps account teams quote accurately and allows warehouse managers to compare the margin of different contracts.
In this hands-on guide, we build a linear-regression baseline that predicts a pallet’s daily storage cost (USD) from routinely logged attributes: the number of pallets, product temperature class, cubic metres, weight, dwell-time category, and warehouse region. The model’s coefficients reveal which factors drive cost the most and provide a benchmark before experimenting with more sophisticated models.
Libraries Required
- pandas # tidy data handling
- numpy # numeric helpers
- matplotlib.pyplot # quick sanity plots (optional)
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression? Warehouse pricing is typically an add‑up of pallet space, temperature surcharge, and long‑stay penalties—an almost linear formula. A straight‑line model captures those effects and yields easily interpretable coefficients.
1. Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
Download warehouse_storage_costs.csv from the Kaggle dataset link and adjust the path:
df = pd.read_csv("warehouse_storage_costs.csv")
print(df.head())
Expected columns
| column | sample values |
| daily_cost_usd | 4.75 |
| pallets | 12 |
| product_class | Ambient / Chilled / Frozen |
| cubic_m3 | 10.4 |
| weight_kg | 3 250 |
| dwell_days | 34 |
| warehouse_region | Northeast / Midwest / South … |
3. Minimal cleaning
Standard scaling places pallets, volume, and weight on a comparable footing, so coefficients are expressed as $/day per 1 σ change—clearly apparent to operations managers.
core = ['daily_cost_usd', 'pallets', 'product_class', 'cubic_m3',
'weight_kg', 'dwell_days', 'warehouse_region']
df = df.dropna(subset=core).copy()
# bucket dwell time (short‑stay vs long‑stay) – optional
df['dwell_bucket'] = pd.cut(df['dwell_days'],
bins=[0,15,60,365],
labels=['short','medium','long'])
4. Define predictors & label
num_cols = ['pallets', 'cubic_m3', 'weight_kg', 'dwell_days'] cat_cols = ['product_class', 'warehouse_region', 'dwell_bucket'] target = 'daily_cost_usd' X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & model pipeline
One-hot encoding assigns each warehouse region and product class its own intercept and shift without assuming an ordinal relationship.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preprocess),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
Performance metrics – R² indicates the proportion of cost variance explained by the basics; MAE, expressed in real dollars, informs account reps of their typical quoting error band.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):.2f} per pallet‑day")
8. Inspect cost drivers
The coefficient table highlights actionable levers: a substantial positive weight for product_class_Frozen quantifies the exact freezer surcharge; a negative coefficient for the South region might reflect cheaper land, informing expansion strategy.
# pull feature names after one‑hot encoding
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coefs = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
.sort_values())
print("\nCost‑reducing factors (negative coefficients):")
print(coefs.head(6))
print("\nCost‑increasing factors (positive coefficients):")
print(coefs.tail(6))
Because numeric inputs are z‑scored, each coefficient reads as $/day change for a one σ shift in that predictor.
9. Persist the trained pipeline
Joblib persistence combines preprocessing and regression weights, allowing tomorrow’s quoting tool to load a .pkl file and deliver an instant price the moment a customer enters their pallet, volume, and temperature requirements.
joblib.dump(pipe, "storage_cost_linreg.pkl")
Summary
In under 100 lines of Python, we transformed raw warehouse logs into an explainable storage-cost estimator. The linear model:
- Delivers instant, justifiable quotes to sales teams and customers alike.
- Reveals transparent cost levers—showing exactly how pallets, volume, weight, and temperature class tug daily storage fees up or down.
Use this interpretable baseline as your yardstick: every boosted tree or optimisation engine you roll out next must beat its mean‑absolute‑error while still telling a story the logistics team can trust.