Data Plan Usage Prediction using Linear Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

Mobile network operators want to anticipate how many gigabytes a subscriber will consume next month so they can:

  • Recommend the right data plan,
  • Avoid throttling unhappy heavy users, and
  • Plan backbone‑capacity upgrades.

Using an open customer‑usage dataset that logs subscribers’ demographics and the last three months of traffic volumes, you’ll build a linear‑regression baseline that predicts NextMonthGB (gigabytes) from features such as:

  • last‑month data, two‑months‑ago data, three‑months‑ago data,
  • billing plan (pre‑paid / post‑paid),
  • device category (feature phone / 4G smartphone / 5G smartphone/tablet),
  • customer tenure, and
  • roaming flag.

A transparent linear model exposes the first-order drivers of usage growth and provides marketing & network‑engineering teams with a rock-solid yardstick before they test tree-based or time-series models.

Libraries Required

  • pandas # data wrangling
  • numpy # numerical helpers
  • matplotlib.pyplot # quick sanity plots (optional)
  • scikit‑learn # preprocessing, model, metrics
  • joblib # persist trained models

Dataset Link

Customer Cellular Data

Step-by-Step Code Implementation

1. Import Libraries

  • pandas, numpy: for tabular data handling and numerical operations.
  • matplotlib.pyplot: (optional) for quick plots & diagnostics
  • Pipeline: to chain preprocessing and model training into a single object (avoids data leakage).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load & Peek at the Data

  • Read the CSV file into a DataFrame named ‘df’.
  • df.head() prints the top rows so you can verify you have the correct columns and no obvious load errors.
df = pd.read_csv("mobile_data_usage.csv")
print(df.head())

Typical columns

Column Example value
NextMonthGB LABEL – 12.4
GB_MonthMinus1 11.0
GB_MonthMinus2 9.3
GB_MonthMinus3 8.7
PlanType Prepaid / Postpaid
DeviceClass 5G Smartphone / 4G Phone / Tablet
TenureMonths 24
RoamingFlag 0 / 1

3. Minimal Cleaning & Feature Set

  • LinearRegression: the core model learning a linear mapping from features to NextMonthGB.
  • dropna(subset=core): removes any rows missing one of the key fields. This ensures we don’t accidentally train on incomplete records.
  • num_cols vs. cat_cols: Numerical: past three months’ usage and customer tenure (continuous, to be scaled) and Categorical: plan type, device class, roaming flag (discrete, to be one‑hot encoded).
  • X, y: split features and target. We’ll use X to predict y = NextMonthGB.
core = ['NextMonthGB','GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
        'PlanType','DeviceClass','TenureMonths','RoamingFlag']
df   = df.dropna(subset=core).copy()

num_cols = ['GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
            'TenureMonths']
cat_cols = ['PlanType','DeviceClass','RoamingFlag']
target   = 'NextMonthGB'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing & Linear‑Regression Pipeline

  • OneHotEncoder, StandardScaler: to preprocess categorical and numerical features, respectively.
  • ColumnTransformer: to apply different preprocessing to different column groups in one step.
  • ColumnTransformer: Applies one set of transformations to cat_cols (OneHotEncoder) and another to num_cols (StandardScaler).
  • Pipeline: bundles preprocessing and the LinearRegression model, allowing for easy calling via pipe.fit() will perform both steps in proper order with no leakage.
preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline(steps=[
        ('prep',  preproc),
        ('model', linreg)
])

5. Train‑test Split & Training

  • train_test_split: to hold out a test set for unbiased evaluation.
  • train_test_split: random_state=42 makes the split reproducible, shuffle=True shuffles before splitting to avoid any time or grouping bias.
  • pipe.fit: Internally fits the scalers and encoder on the training set (X_train), transforms it, and fits the linear model to y_train.
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Evaluation

  • r2_score, mean_absolute_error: metrics to quantify how well predictions match actual usage.
  • y_pred: model’s estimate of next‑month GB for each test sample.
  • R²: proportion of variance in actual usage explained by the model (1.0 is perfect).
  • MAE: average absolute error in GB—easy to interpret (e.g. “on average we’re off by 1.2 GB”).
y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} GB")

7. Interpret Usage Drivers

  • get_feature_names_out: retrieves the new column names created by one‑hot encoding.
  • We then build a Series of coefficients indexed by feature name, sort it, and examine:
    • Negative coefficients → features that reduce predicted usage.
    • Positive coefficients → features that increase predicted usage.
  • Since numeric features were z‑scored, “1 unit” in the coefficient means a one σ change in the raw feature.
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
        .sort_values())

print("\nUsage‑reducing factors (negative coefficients):")
print(coef.head(6))

print("\nUsage‑boosting factors (positive coefficients):")
print(coef.tail(6))

Because numeric inputs are z‑scored, each coefficient reads as GB change for a one σ shift; one‑hot coefficients read as GB bump versus reference level.

8. Persist the Trained Model

  • joblib: for saving (“pickling”) the trained pipeline to disk.
  • joblib.dump serialises the entire pipeline (preprocessing + model) to disk.
joblib.dump(pipe, "data_plan_usage_linreg.pkl")

Summary

With ~100 lines of Python, you now have an explainable data‑plan usage forecaster:

  • Instant GB predictions for right‑sizing plans and targeting upsell.
  • Clear levers – e.g., “5G device adds ~3.8 GB next month,” or “every extra GB last month predicts +0.85 GB next month.”

Keep the .pkl handy; when you graduate to time‑series or gradient‑boosted models, compare their MAE against this simple, fully transparent baseline.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *