Support Ticket Resolution Time Prediction using Linear Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
A customer support centre wants to predict how long an open ticket will take to reach complete resolution as soon as the ticket is logged. Knowing this ahead of time allows managers to set realistic service-level expectations, auto-prioritise urgent issues, and staff shifts efficiently.
Using the public Customer Support Ticket Dataset on Kaggle, which contains over 8,000 tickets annotated with fields such as priority, category, creation channel, customer type, agent experience, and the ground-truth resolution time, we will train a linear regression baseline that estimates ResolutionTimeMinutes from the other ticket attributes. A transparent linear model reveals the first-order drivers (e.g., priority, agent seniority, issue category) that influence resolution time, providing a benchmark for any future tree-based or time-series approach.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # sanity‑check plots (optional)
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained model
Dataset Link
Customer Support Ticket Dataset
Step-by-Step Code Implementation
Why linear regression first? Ticket costs and durations are usually additive: base handling time + extra penalties for high priority, complex categories, attachments, etc. A straight-line model clearly captures this, allowing support managers to see the minutes per factor.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load & glimpse the data
Download and unzip customer_support_data.csv from the Kaggle page above, then:
df = pd.read_csv("customer_support_data.csv")
print(df.head())
Typical columns
| column | example values |
| ResolutionTimeMinutes | label – 5 … 3 200 |
| Priority | Low / Normal / High / Urgent |
| IssueCategory | Billing, Login, Shipping … |
| Channel | Email / Chat / Phone |
| CustomerType | New / Returning / VIP |
| AgentExperienceYears | 0 – 10 |
| Attachments | 0 – 5 files |
| InitialResponseMin | minutes until first reply |
3. Basic cleaning
Standard scaling places numeric inputs on equal footing, so coefficients are interpreted as “minutes per 1 σ change,” thereby aiding interpretability.
core = ['ResolutionTimeMinutes', 'Priority', 'IssueCategory',
'Channel', 'CustomerType', 'AgentExperienceYears',
'Attachments', 'InitialResponseMin']
df = df.dropna(subset=core).copy()
4. Define features & label
num_cols = ['AgentExperienceYears', 'Attachments', 'InitialResponseMin'] cat_cols = ['Priority', 'IssueCategory', 'Channel', 'CustomerType'] target = 'ResolutionTimeMinutes' X = df[num_cols + cat_cols] y = df[target]
5. Pre‑processing & model pipeline
One‑hot encoding gives each priority, channel, or customer segment its own intercept shift—no false ordinal assumptions.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline(steps=[
('prep', preproc),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
R² and MAE together show what fraction of variability we capture and the average absolute error the dispatcher can expect.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} minutes")
8. Interpret influential features
The coefficient table highlights actionable levers: if InitialResponseMin has a significant positive weight, shaving that metric yields direct gains in total resolution time.
# recover encoded feature names
ohe_feats = pipe.named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coef = (pd.Series(pipe.named_steps['model'].coef_, index=all_feats)
.sort_values())
print("\nFast‑resolution factors (negative coefficients):")
print(coef.head(8))
print("\nDelay‑driving factors (positive coefficients):")
print(coef.tail(8))
9. Persist the trained pipeline
Pipeline persistence freezes preprocessing rules and coefficients together; tomorrow’s CRM system can call joblib.load and generate live ETAs for brand‑new tickets with zero retraining.
joblib.dump(pipe, "ticket_resolution_linreg.pkl")
Summary
In fewer than 100 lines of Python, we converted raw help‑desk logs into an explainable ticket‑resolution‑time predictor. The linear model:
- Delivers instant, transparent ETAs that agents can share with customers.
- Quantifies bottlenecks—clarifying exactly how priority, channel, agent tenure, or response delay impacts the service timeline.
Use this baseline to benchmark future models: any tree-based or time-aware algorithm must achieve a lower mean absolute error while still providing insights that the support team can act upon.