Support Ticket Resolution Time Prediction using Linear Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

A customer support centre wants to predict how long an open ticket will take to reach complete resolution as soon as the ticket is logged. Knowing this ahead of time allows managers to set realistic service-level expectations, auto-prioritise urgent issues, and staff shifts efficiently.

Using the public Customer Support Ticket Dataset on Kaggle, which contains over 8,000 tickets annotated with fields such as priority, category, creation channel, customer type, agent experience, and the ground-truth resolution time, we will train a linear regression baseline that estimates ResolutionTimeMinutes from the other ticket attributes. A transparent linear model reveals the first-order drivers (e.g., priority, agent seniority, issue category) that influence resolution time, providing a benchmark for any future tree-based or time-series approach.

Libraries Required

pandas # data wrangling
numpy # numerical helpers
matplotlib.pyplot # sanity‑check plots (optional)
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained model

Dataset Link

Customer Support Ticket Dataset

Step-by-Step Code Implementation

Why linear regression first? Ticket costs and durations are usually additive: base handling time + extra penalties for high priority, complex categories, attachments, etc. A straight-line model clearly captures this, allowing support managers to see the minutes per factor.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2.  Load & glimpse the data

Download and unzip customer_support_data.csv from the Kaggle page above, then:

df = pd.read_csv("customer_support_data.csv")
print(df.head())

Typical columns

column	example values
ResolutionTimeMinutes	label – 5 … 3 200
Priority	Low / Normal / High / Urgent
IssueCategory	Billing, Login, Shipping …
Channel	Email / Chat / Phone
CustomerType	New / Returning / VIP
AgentExperienceYears	0 – 10
Attachments	0 – 5 files
InitialResponseMin	minutes until first reply

3. Basic cleaning

Standard scaling places numeric inputs on equal footing, so coefficients are interpreted as “minutes per 1 σ change,” thereby aiding interpretability.

core = ['ResolutionTimeMinutes', 'Priority', 'IssueCategory',
        'Channel', 'CustomerType', 'AgentExperienceYears',
        'Attachments', 'InitialResponseMin']

df = df.dropna(subset=core).copy()

4.  Define features & label

num_cols = ['AgentExperienceYears', 'Attachments', 'InitialResponseMin']
cat_cols = ['Priority', 'IssueCategory', 'Channel', 'CustomerType']
target   = 'ResolutionTimeMinutes'

X = df[num_cols + cat_cols]
y = df[target]

5. Pre‑processing & model pipeline

One‑hot encoding gives each priority, channel, or customer segment its own intercept shift—no false ordinal assumptions.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()

pipe = Pipeline(steps=[
        ('prep',  preproc),
        ('model', linreg)
])

6. Train‑test split & training

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7. Evaluation

R² and MAE together show what fraction of variability we capture and the average absolute error the dispatcher can expect.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} minutes")

8. Interpret influential features

The coefficient table highlights actionable levers: if InitialResponseMin has a significant positive weight, shaving that metric yields direct gains in total resolution time.

# recover encoded feature names
ohe_feats = pipe.named_steps['prep']\
                .named_transformers_['cat']\
                .get_feature_names_out(cat_cols)

all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
        .sort_values())

print("\nFast‑resolution factors (negative coefficients):")
print(coef.head(8))
print("\nDelay‑driving factors (positive coefficients):")
print(coef.tail(8))

9. Persist the trained pipeline

Pipeline persistence freezes preprocessing rules and coefficients together; tomorrow’s CRM system can call joblib.load and generate live ETAs for brand‑new tickets with zero retraining.

joblib.dump(pipe, "ticket_resolution_linreg.pkl")

Summary

In fewer than 100 lines of Python, we converted raw help‑desk logs into an explainable ticket‑resolution‑time predictor. The linear model:

Delivers instant, transparent ETAs that agents can share with customers.
Quantifies bottlenecks—clarifying exactly how priority, channel, agent tenure, or response delay impacts the service timeline.

Use this baseline to benchmark future models: any tree-based or time-aware algorithm must achieve a lower mean absolute error while still providing insights that the support team can act upon.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook