Shipping Rate Prediction using Linear Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Freight forwarders and logistics start‑ups must quote door‑to‑door shipping rates in seconds, yet each quote depends on weight, distance band, transport mode, service speed, and fuel surcharges. Overpricing scares customers away, while underpricing erodes margins.
In this walkthrough, we create a linear‑regression baseline that predicts a shipment’s all‑in rate (USD) from routinely captured booking information: package weight, volumetric weight, origin‑to‑destination distance band, service tier (standard / express), transport mode (air / sea / truck), shipment type (parcel / pallet / container), declared value, and fuel‑price month.
A transparent linear model reveals first‑order cost drivers and serves as the yardstick before you move on to gradient‑boosted trees or network‑pricing engines.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # optional quick plots
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Supply‑Chain Shipment Pricing Data
Step-by-Step Implementation
Why linear regression? Freight tariffs are typically a base fee plus additional surcharges for weight, distance, service speed, and special handling. OLS captures this additive logic and outputs coefficients that pricing managers can sanity‑check.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib`
2. Load the dataset
df = pd.read_csv("SCMS_Delivery_History_Dataset.csv") # path after un‑zip
print(df.head())
Typical columns in Supply‑Chain Shipment Pricing Data
| column | example |
| FreightCostUSD | 2 145 |
| Weight_kgs | 780 |
| Volume_m3 | 5.3 |
| DistanceGroup | 150‑500 km |
| TransportMode | Air / Road / Sea |
| ServiceLevel | Express / Standard |
| ShipmentType | Parcel / Pallet / Container |
| DeclaredValueUSD | 54 000 |
| FuelPriceIndex | 0.87 |
3. Minimal cleaning & feature lists
Standard scaling on numeric inputs puts kilograms, cubic metres, declared value, and fuel index on comparable variance, so coefficients are read as dollars per 1 σ shift.
core = ['FreightCostUSD','Weight_kgs','Volume_m3','DistanceGroup',
'TransportMode','ServiceLevel','ShipmentType',
'DeclaredValueUSD','FuelPriceIndex']
df = df.dropna(subset=core).copy()
num_cols = ['Weight_kgs','Volume_m3','DeclaredValueUSD','FuelPriceIndex']
cat_cols = ['DistanceGroup','TransportMode','ServiceLevel','ShipmentType']
target = 'FreightCostUSD'
X = df[num_cols + cat_cols]
y = df[target]
4. Pre‑processing & model pipeline
One-hot encoding ensures that there is no fake numeric order between service tiers or distance buckets, while assigning each its own $- offset.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
5. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
6. Evaluation
Performance metrics – R² indicates the proportion of cost variance that the simple formula captures; MAE (e.g., ±$180) informs sales teams about their typical quoting error band.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f} per shipment")
7. Interpret rate drivers
Coefficient table instantly surfaces high‑impact levers: e.g., TransportMode_Air might add $420, while DistanceGroup_0‑150 km subtracts $115—direct input for discount matrices.
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
.sort_values())
print("\nCost‑reducing factors (negative coefficients):")
print(coef.head(8))
print("\nCost‑increasing factors (positive coefficients):")
print(coef.tail(8))
Numeric coefficients are expressed as $/shipment for a one σ change; one‑hot coefficients are $‑offsets vs the reference category.
8. Persist the trained pipeline
Joblib persistence freezes preprocessing and coefficients together; tomorrow’s quoting API can joblib.load(“shipping_rate_linreg.pkl”), feed in a JSON of booking details, and return a live price in milliseconds.
joblib.dump(pipe, "shipping_rate_linreg.pkl")
Summary
With ~120 lines of Python, we’ve turned raw shipment logs into an explainable shipping‑rate engine:
- Instant, data‑backed quotes for sales reps and self‑serve booking portals.
- Crystal‑clear surcharges & discounts that reveal precisely how weight, distance, mode, and service tier tug cost up or down.
This transparent linear baseline is your benchmark—every boosted tree, neural tariff model, or optimisation engine you test next must beat its MAE while still telling a story that the pricing team can trust.