Data Plan Usage Prediction using Linear Regression in ML
We offer you a brighter future with FREE online courses - Start Now!!
Mobile network operators want to anticipate how many gigabytes a subscriber will consume next month so they can:
- Recommend the right data plan,
- Avoid throttling unhappy heavy users, and
- Plan backbone‑capacity upgrades.
Using an open customer‑usage dataset that logs subscribers’ demographics and the last three months of traffic volumes, you’ll build a linear‑regression baseline that predicts NextMonthGB (gigabytes) from features such as:
- last‑month data, two‑months‑ago data, three‑months‑ago data,
- billing plan (pre‑paid / post‑paid),
- device category (feature phone / 4G smartphone / 5G smartphone/tablet),
- customer tenure, and
- roaming flag.
A transparent linear model exposes the first-order drivers of usage growth and provides marketing & network‑engineering teams with a rock-solid yardstick before they test tree-based or time-series models.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick sanity plots (optional)
- scikit‑learn # preprocessing, model, metrics
- joblib # persist trained models
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
- pandas, numpy: for tabular data handling and numerical operations.
- matplotlib.pyplot: (optional) for quick plots & diagnostics
- Pipeline: to chain preprocessing and model training into a single object (avoids data leakage).
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load & Peek at the Data
- Read the CSV file into a DataFrame named ‘df’.
- df.head() prints the top rows so you can verify you have the correct columns and no obvious load errors.
df = pd.read_csv("mobile_data_usage.csv")
print(df.head())
Typical columns
| Column | Example value |
| NextMonthGB | LABEL – 12.4 |
| GB_MonthMinus1 | 11.0 |
| GB_MonthMinus2 | 9.3 |
| GB_MonthMinus3 | 8.7 |
| PlanType | Prepaid / Postpaid |
| DeviceClass | 5G Smartphone / 4G Phone / Tablet |
| TenureMonths | 24 |
| RoamingFlag | 0 / 1 |
3. Minimal Cleaning & Feature Set
- LinearRegression: the core model learning a linear mapping from features to NextMonthGB.
- dropna(subset=core): removes any rows missing one of the key fields. This ensures we don’t accidentally train on incomplete records.
- num_cols vs. cat_cols: Numerical: past three months’ usage and customer tenure (continuous, to be scaled) and Categorical: plan type, device class, roaming flag (discrete, to be one‑hot encoded).
- X, y: split features and target. We’ll use X to predict y = NextMonthGB.
core = ['NextMonthGB','GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
'PlanType','DeviceClass','TenureMonths','RoamingFlag']
df = df.dropna(subset=core).copy()
num_cols = ['GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
'TenureMonths']
cat_cols = ['PlanType','DeviceClass','RoamingFlag']
target = 'NextMonthGB'
X = df[num_cols + cat_cols]
y = df[target]
4. Pre‑processing & Linear‑Regression Pipeline
- OneHotEncoder, StandardScaler: to preprocess categorical and numerical features, respectively.
- ColumnTransformer: to apply different preprocessing to different column groups in one step.
- ColumnTransformer: Applies one set of transformations to cat_cols (OneHotEncoder) and another to num_cols (StandardScaler).
- Pipeline: bundles preprocessing and the LinearRegression model, allowing for easy calling via pipe.fit() will perform both steps in proper order with no leakage.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline(steps=[
('prep', preproc),
('model', linreg)
])
5. Train‑test Split & Training
- train_test_split: to hold out a test set for unbiased evaluation.
- train_test_split: random_state=42 makes the split reproducible, shuffle=True shuffles before splitting to avoid any time or grouping bias.
- pipe.fit: Internally fits the scalers and encoder on the training set (X_train), transforms it, and fits the linear model to y_train.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
6. Evaluation
- r2_score, mean_absolute_error: metrics to quantify how well predictions match actual usage.
- y_pred: model’s estimate of next‑month GB for each test sample.
- R²: proportion of variance in actual usage explained by the model (1.0 is perfect).
- MAE: average absolute error in GB—easy to interpret (e.g. “on average we’re off by 1.2 GB”).
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} GB")
7. Interpret Usage Drivers
- get_feature_names_out: retrieves the new column names created by one‑hot encoding.
- We then build a Series of coefficients indexed by feature name, sort it, and examine:
- Negative coefficients → features that reduce predicted usage.
- Positive coefficients → features that increase predicted usage.
- Since numeric features were z‑scored, “1 unit” in the coefficient means a one σ change in the raw feature.
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
.sort_values())
print("\nUsage‑reducing factors (negative coefficients):")
print(coef.head(6))
print("\nUsage‑boosting factors (positive coefficients):")
print(coef.tail(6))
Because numeric inputs are z‑scored, each coefficient reads as GB change for a one σ shift; one‑hot coefficients read as GB bump versus reference level.
8. Persist the Trained Model
- joblib: for saving (“pickling”) the trained pipeline to disk.
- joblib.dump serialises the entire pipeline (preprocessing + model) to disk.
joblib.dump(pipe, "data_plan_usage_linreg.pkl")
Summary
With ~100 lines of Python, you now have an explainable data‑plan usage forecaster:
- Instant GB predictions for right‑sizing plans and targeting upsell.
- Clear levers – e.g., “5G device adds ~3.8 GB next month,” or “every extra GB last month predicts +0.85 GB next month.”
Keep the .pkl handy; when you graduate to time‑series or gradient‑boosted models, compare their MAE against this simple, fully transparent baseline.