Storage Cost Prediction using Linear Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Third-party logistics (3PL) operators charge customers a daily storage fee that depends on the pallet count, product class (ambient vs. chilled), cubic volume, and location. Being able to predict the fee upfront helps account teams quote accurately and allows warehouse managers to compare the margin of different contracts.

In this hands-on guide, we build a linear-regression baseline that predicts a pallet’s daily storage cost (USD) from routinely logged attributes: the number of pallets, product temperature class, cubic metres, weight, dwell-time category, and warehouse region. The model’s coefficients reveal which factors drive cost the most and provide a benchmark before experimenting with more sophisticated models.

Libraries Required

pandas # tidy data handling
numpy # numeric helpers
matplotlib.pyplot # quick sanity plots (optional)
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Warehouse Dataset

Step-by-Step Code Implementation

Why linear regression? Warehouse pricing is typically an add‑up of pallet space, temperature surcharge, and long‑stay penalties—an almost linear formula. A straight‑line model captures those effects and yields easily interpretable coefficients.

1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load the data

Download warehouse_storage_costs.csv from the Kaggle dataset link and adjust the path:

df = pd.read_csv("warehouse_storage_costs.csv")
print(df.head())

Expected columns

column	sample values
daily_cost_usd	4.75
pallets	12
product_class	Ambient / Chilled / Frozen
cubic_m3	10.4
weight_kg	3 250
dwell_days	34
warehouse_region	Northeast / Midwest / South …

3. Minimal cleaning

Standard scaling places pallets, volume, and weight on a comparable footing, so coefficients are expressed as $/day per 1 σ change—clearly apparent to operations managers.

core = ['daily_cost_usd', 'pallets', 'product_class', 'cubic_m3',
        'weight_kg', 'dwell_days', 'warehouse_region']
df = df.dropna(subset=core).copy()

# bucket dwell time (short‑stay vs long‑stay) – optional
df['dwell_bucket'] = pd.cut(df['dwell_days'],
                            bins=[0,15,60,365],
                            labels=['short','medium','long'])

4.  Define predictors & label

num_cols = ['pallets', 'cubic_m3', 'weight_kg', 'dwell_days']
cat_cols = ['product_class', 'warehouse_region', 'dwell_bucket']
target   = 'daily_cost_usd'

X = df[num_cols + cat_cols]
y = df[target]

5. Pre‑processing & model pipeline

One-hot encoding assigns each warehouse region and product class its own intercept and shift without assuming an ordinal relationship.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline([
        ('prep',  preprocess),
        ('model', linreg)
])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Evaluation

Performance metrics – R² indicates the proportion of cost variance explained by the basics; MAE, expressed in real dollars, informs account reps of their typical quoting error band.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):.2f} per pallet‑day")

8.  Inspect cost drivers

The coefficient table highlights actionable levers: a substantial positive weight for product_class_Frozen quantifies the exact freezer surcharge; a negative coefficient for the South region might reflect cheaper land, informing expansion strategy.

# pull feature names after one‑hot encoding
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coefs     = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
             .sort_values())

print("\nCost‑reducing factors (negative coefficients):")
print(coefs.head(6))
print("\nCost‑increasing factors (positive coefficients):")
print(coefs.tail(6))

Because numeric inputs are z‑scored, each coefficient reads as $/day change for a one σ shift in that predictor.

9. Persist the trained pipeline

Joblib persistence combines preprocessing and regression weights, allowing tomorrow’s quoting tool to load a .pkl file and deliver an instant price the moment a customer enters their pallet, volume, and temperature requirements.

joblib.dump(pipe, "storage_cost_linreg.pkl")

Summary

In under 100 lines of Python, we transformed raw warehouse logs into an explainable storage-cost estimator. The linear model:

Delivers instant, justifiable quotes to sales teams and customers alike.
Reveals transparent cost levers—showing exactly how pallets, volume, weight, and temperature class tug daily storage fees up or down.

Use this interpretable baseline as your yardstick: every boosted tree or optimisation engine you roll out next must beat its mean‑absolute‑error while still telling a story the logistics team can trust.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook