Data Plan Usage Prediction using Linear Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Mobile network operators want to anticipate how many gigabytes a subscriber will consume next month so they can:

Recommend the right data plan,
Avoid throttling unhappy heavy users, and
Plan backbone‑capacity upgrades.

Using an open customer‑usage dataset that logs subscribers’ demographics and the last three months of traffic volumes, you’ll build a linear‑regression baseline that predicts NextMonthGB (gigabytes) from features such as:

last‑month data, two‑months‑ago data, three‑months‑ago data,
billing plan (pre‑paid / post‑paid),
device category (feature phone / 4G smartphone / 5G smartphone/tablet),
customer tenure, and
roaming flag.

A transparent linear model exposes the first-order drivers of usage growth and provides marketing & network‑engineering teams with a rock-solid yardstick before they test tree-based or time-series models.

Libraries Required

pandas # data wrangling
numpy # numerical helpers
matplotlib.pyplot # quick sanity plots (optional)
scikit‑learn # preprocessing, model, metrics
joblib # persist trained models

Dataset Link

Customer Cellular Data

Step-by-Step Code Implementation

1. Import Libraries

pandas, numpy: for tabular data handling and numerical operations.
matplotlib.pyplot: (optional) for quick plots & diagnostics
Pipeline: to chain preprocessing and model training into a single object (avoids data leakage).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load & Peek at the Data

Read the CSV file into a DataFrame named ‘df’.
df.head() prints the top rows so you can verify you have the correct columns and no obvious load errors.

df = pd.read_csv("mobile_data_usage.csv")
print(df.head())

Typical columns

Column	Example value
NextMonthGB	LABEL – 12.4
GB_MonthMinus1	11.0
GB_MonthMinus2	9.3
GB_MonthMinus3	8.7
PlanType	Prepaid / Postpaid
DeviceClass	5G Smartphone / 4G Phone / Tablet
TenureMonths	24
RoamingFlag	0 / 1

3. Minimal Cleaning & Feature Set

LinearRegression: the core model learning a linear mapping from features to NextMonthGB.
dropna(subset=core): removes any rows missing one of the key fields. This ensures we don’t accidentally train on incomplete records.
num_cols vs. cat_cols: Numerical: past three months’ usage and customer tenure (continuous, to be scaled) and Categorical: plan type, device class, roaming flag (discrete, to be one‑hot encoded).
X, y: split features and target. We’ll use X to predict y = NextMonthGB.

core = ['NextMonthGB','GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
        'PlanType','DeviceClass','TenureMonths','RoamingFlag']
df   = df.dropna(subset=core).copy()

num_cols = ['GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
            'TenureMonths']
cat_cols = ['PlanType','DeviceClass','RoamingFlag']
target   = 'NextMonthGB'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing & Linear‑Regression Pipeline

OneHotEncoder, StandardScaler: to preprocess categorical and numerical features, respectively.
ColumnTransformer: to apply different preprocessing to different column groups in one step.
ColumnTransformer: Applies one set of transformations to cat_cols (OneHotEncoder) and another to num_cols (StandardScaler).
Pipeline: bundles preprocessing and the LinearRegression model, allowing for easy calling via pipe.fit() will perform both steps in proper order with no leakage.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline(steps=[
        ('prep',  preproc),
        ('model', linreg)
])

5. Train‑test Split & Training

train_test_split: to hold out a test set for unbiased evaluation.
train_test_split: random_state=42 makes the split reproducible, shuffle=True shuffles before splitting to avoid any time or grouping bias.
pipe.fit: Internally fits the scalers and encoder on the training set (X_train), transforms it, and fits the linear model to y_train.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Evaluation

r2_score, mean_absolute_error: metrics to quantify how well predictions match actual usage.
y_pred: model’s estimate of next‑month GB for each test sample.
R²: proportion of variance in actual usage explained by the model (1.0 is perfect).
MAE: average absolute error in GB—easy to interpret (e.g. “on average we’re off by 1.2 GB”).

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} GB")

7. Interpret Usage Drivers

get_feature_names_out: retrieves the new column names created by one‑hot encoding.
We then build a Series of coefficients indexed by feature name, sort it, and examine:
- Negative coefficients → features that reduce predicted usage.
- Positive coefficients → features that increase predicted usage.
Since numeric features were z‑scored, “1 unit” in the coefficient means a one σ change in the raw feature.

ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
        .sort_values())

print("\nUsage‑reducing factors (negative coefficients):")
print(coef.head(6))

print("\nUsage‑boosting factors (positive coefficients):")
print(coef.tail(6))

Because numeric inputs are z‑scored, each coefficient reads as GB change for a one σ shift; one‑hot coefficients read as GB bump versus reference level.

8. Persist the Trained Model

joblib: for saving (“pickling”) the trained pipeline to disk.
joblib.dump serialises the entire pipeline (preprocessing + model) to disk.

joblib.dump(pipe, "data_plan_usage_linreg.pkl")

Summary

With ~100 lines of Python, you now have an explainable data‑plan usage forecaster:

Instant GB predictions for right‑sizing plans and targeting upsell.
Clear levers – e.g., “5G device adds ~3.8 GB next month,” or “every extra GB last month predicts +0.85 GB next month.”

Keep the .pkl handy; when you graduate to time‑series or gradient‑boosted models, compare their MAE against this simple, fully transparent baseline.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook