Customer Spending Prediction using Linear Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
E‑commerce companies log everything—from a visitor’s session length to the number of minutes they spend browsing the mobile app. Converting those behavioural breadcrumbs into a dollar‑value forecast of annual spend lets marketers target discounts wisely, finance teams budget revenue more realistically, and product managers decide whether to invest in mobile or web features.
In this project, we build a linear regression baseline that predicts a customer’s Yearly Amount Spent (USD) from four readily available engagement metrics: average session length, time on app, time on website, and length of membership.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot# sanity‑check visuals
- seaborn # quick pair‑plots / heatmaps (optional)
- scikit‑learn # model building & evaluation
- joblib # save the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression? For marketing teams, a transparent line with four coefficients beats a black‑box forest when explaining why the model recommends a mobile‑channel push.
1. Import core libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load the data
Download the CSV file from Kaggle and point to the local path:
df = pd.read_csv("Ecommerce Customers.csv")
print(df.head())
3. Pair‑plot first
A quick scatter‑matrix highlights prominent linear trends (Time on App usually shows the strongest slope).
# Visual feel for relationships
sns.pairplot(df[['Avg. Session Length','Time on App',
'Time on Website','Length of Membership',
'Yearly Amount Spent']])
plt.show()
print(df.describe())
4. Feature matrix & target vector
features = ['Avg. Session Length',
'Time on App',
'Time on Website',
'Length of Membership']
X = df[features]
y = df['Yearly Amount Spent']
5. Train–test split
Train–test split ensures we judge performance on unseen customers—critical when generalising to next quarter’s cohort.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
6. Model training
linreg = LinearRegression() linreg.fit(X_train, y_train)
7. Metrics chosen
- R² (goodness of fit) tells us what fraction of the variance our four metrics explain.
- MAE (absolute dollar error) gives a wallet‑level feel—if MAE ≈ $300, finance knows typical forecasts are within $300 of reality.
y_pred = linreg.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.2f}")
8. Interpret coefficients
The coefficient table instantly ranks levers. If “Time on App” dominates, the product can justify investing in mobile UX.
coef_df = pd.DataFrame({
'feature': features,
'coefficient': linreg.coef_
}).sort_values('coefficient', ascending=False)
print(coef_df)
A positive coefficient means every extra unit of that feature nudges spending upward; a negative coefficient would imply the opposite.
9. Model persistence
Model persistence with joblib lets you drop the .pkl file into a Flask API or nightly batch job without retraining.
joblib.dump(linreg, "customer_spending_linreg.pkl")
Summary
In under fifty lines of Python, we transformed raw engagement logs into an actionable spending‑prediction tool. The linear regression delivers two wins: (1) a numeric forecast every time marketing uploads fresh behaviour metrics, and (2) easy‑to‑explain coefficients that spotlight high‑ROI levers. Keep this interpretable baseline as your benchmark; when you later explore regularised regressors or gradient boosting, you’ll know exactly how much extra accuracy the added complexity buys.