Data Plan Usage Prediction using Linear Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Mobile network operators want to anticipate how many gigabytes a subscriber will consume next month so they can:

  • Recommend the right data plan,
  • Avoid throttling unhappy heavy users, and
  • Plan backbone‑capacity upgrades.

Using an open customer‑usage dataset that logs subscribers’ demographics and the last three months of traffic volumes, you’ll build a linear‑regression baseline that predicts NextMonthGB (gigabytes) from features such as:

  • last‑month data, two‑months‑ago data, three‑months‑ago data,
  • billing plan (pre‑paid / post‑paid),
  • device category (feature phone / 4G smartphone / 5G smartphone/tablet),
  • customer tenure, and
  • roaming flag.

A transparent linear model exposes the first-order drivers of usage growth and provides marketing & network‑engineering teams with a rock-solid yardstick before they test tree-based or time-series models.

Libraries Required

  • pandas # data wrangling
  • numpy # numerical helpers
  • matplotlib.pyplot # quick sanity plots (optional)
  • scikit‑learn # preprocessing, model, metrics
  • joblib # persist trained models

Dataset Link

Customer Cellular Data

Step-by-Step Code Implementation

1. Import Libraries

  • pandas, numpy: for tabular data handling and numerical operations.
  • matplotlib.pyplot: (optional) for quick plots & diagnostics
  • Pipeline: to chain preprocessing and model training into a single object (avoids data leakage).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load & Peek at the Data

  • Read the CSV file into a DataFrame named ‘df’.
  • df.head() prints the top rows so you can verify you have the correct columns and no obvious load errors.
df = pd.read_csv("mobile_data_usage.csv")
print(df.head())

Typical columns

Column Example value
NextMonthGB LABEL – 12.4
GB_MonthMinus1 11.0
GB_MonthMinus2 9.3
GB_MonthMinus3 8.7
PlanType Prepaid / Postpaid
DeviceClass 5G Smartphone / 4G Phone / Tablet
TenureMonths 24
RoamingFlag 0 / 1

3. Minimal Cleaning & Feature Set

  • LinearRegression: the core model learning a linear mapping from features to NextMonthGB.
  • dropna(subset=core): removes any rows missing one of the key fields. This ensures we don’t accidentally train on incomplete records.
  • num_cols vs. cat_cols: Numerical: past three months’ usage and customer tenure (continuous, to be scaled) and Categorical: plan type, device class, roaming flag (discrete, to be one‑hot encoded).
  • X, y: split features and target. We’ll use X to predict y = NextMonthGB.
core = ['NextMonthGB','GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
        'PlanType','DeviceClass','TenureMonths','RoamingFlag']
df   = df.dropna(subset=core).copy()

num_cols = ['GB_MonthMinus1','GB_MonthMinus2','GB_MonthMinus3',
            'TenureMonths']
cat_cols = ['PlanType','DeviceClass','RoamingFlag']
target   = 'NextMonthGB'

X = df[num_cols + cat_cols]
y = df[target]

4. Pre‑processing & Linear‑Regression Pipeline

  • OneHotEncoder, StandardScaler: to preprocess categorical and numerical features, respectively.
  • ColumnTransformer: to apply different preprocessing to different column groups in one step.
  • ColumnTransformer: Applies one set of transformations to cat_cols (OneHotEncoder) and another to num_cols (StandardScaler).
  • Pipeline: bundles preprocessing and the LinearRegression model, allowing for easy calling via pipe.fit() will perform both steps in proper order with no leakage.
preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline(steps=[
        ('prep',  preproc),
        ('model', linreg)
])

5. Train‑test Split & Training

  • train_test_split: to hold out a test set for unbiased evaluation.
  • train_test_split: random_state=42 makes the split reproducible, shuffle=True shuffles before splitting to avoid any time or grouping bias.
  • pipe.fit: Internally fits the scalers and encoder on the training set (X_train), transforms it, and fits the linear model to y_train.
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

6. Evaluation

  • r2_score, mean_absolute_error: metrics to quantify how well predictions match actual usage.
  • y_pred: model’s estimate of next‑month GB for each test sample.
  • R²: proportion of variance in actual usage explained by the model (1.0 is perfect).
  • MAE: average absolute error in GB—easy to interpret (e.g. “on average we’re off by 1.2 GB”).
y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.2f} GB")

7. Interpret Usage Drivers

  • get_feature_names_out: retrieves the new column names created by one‑hot encoding.
  • We then build a Series of coefficients indexed by feature name, sort it, and examine:
    • Negative coefficients → features that reduce predicted usage.
    • Positive coefficients → features that increase predicted usage.
  • Since numeric features were z‑scored, “1 unit” in the coefficient means a one σ change in the raw feature.
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
        .sort_values())

print("\nUsage‑reducing factors (negative coefficients):")
print(coef.head(6))

print("\nUsage‑boosting factors (positive coefficients):")
print(coef.tail(6))

Because numeric inputs are z‑scored, each coefficient reads as GB change for a one σ shift; one‑hot coefficients read as GB bump versus reference level.

8. Persist the Trained Model

  • joblib: for saving (“pickling”) the trained pipeline to disk.
  • joblib.dump serialises the entire pipeline (preprocessing + model) to disk.
joblib.dump(pipe, "data_plan_usage_linreg.pkl")

Summary

With ~100 lines of Python, you now have an explainable data‑plan usage forecaster:

  • Instant GB predictions for right‑sizing plans and targeting upsell.
  • Clear levers – e.g., “5G device adds ~3.8 GB next month,” or “every extra GB last month predicts +0.85 GB next month.”

Keep the .pkl handy; when you graduate to time‑series or gradient‑boosted models, compare their MAE against this simple, fully transparent baseline.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *