Customer Support Cost Prediction with Lasso Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Every extra minute an agent spends replying to a customer ticket increases operating expense. If support managers could estimate the handling cost before a new request reaches an agent—based on channel, sentiment, and customer metadata—they could route, automate, or upsell self‑service more effectively. This project trains a Lasso‑regularised linear model that:
- Predicts the handling cost (USD) for an incoming customer‑support conversation on Twitter.
- The text highlights the handful of request attributes (e.g., negative sentiment, product line, off‑hours arrival) that most inflate cost, thanks to Lasso’s ℓ¹ penalty shrinking weak predictors to zero.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Text & time parsing | nltk (tokenisation, sentiment), dateutil |
| Visualisation | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, Pipeline, Lasso, GridSearchCV |
| Metrics | mean_squared_error, r2_score |
Dataset Link
Step by Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from dateutil import parser import nltk, re, string from nltk.sentiment import SentimentIntensityAnalyzer from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download & load dataset
Dataset supplies 3 M tweets—customer messages & brand replies—from dozens of companies.
# one‑time (needs Kaggle API):
# kaggle datasets download -d thoughtvector/customer-support-on-twitter -p data --unzip
df = pd.read_csv("data/twcs.csv") # 2.8 M rows
3. Feature & target engineering
- Define handling‑cost proxy: minutes an agent spent from the first company reply to the final tweet × $0.50 per agent‑minute.
- Sentiment score, message length (# words, # chars), and shift (business vs off‑hours). All is known the moment the first customer tweet arrives, ensuring zero leakage.
- we proxy cost as agent time == minutes between first and last company reply × $0.50/minute. Replace with your actual cost rate or ticket system timestamps.
COST_PER_MIN = 0.50 # USD per support minute
# Identify threads (conversation_id groups). First company reply timestamp
df['created_at'] = pd.to_datetime(df['created_at'])
df = df.sort_values(['conversation_id', 'created_at'])
first_company = df[df['author_id'].str.startswith('Brand')].groupby('conversation_id')['created_at'].first()
last_company = df[df['author_id'].str.startswith('Brand')].groupby('conversation_id')['created_at'].last()
handle_df = pd.DataFrame({
'first_reply': first_company,
'last_reply' : last_company
}).dropna()
handle_df['duration_min'] = (handle_df['last_reply'] - handle_df['first_reply']).dt.total_seconds() / 60
handle_df['support_cost'] = handle_df['duration_min'] * COST_PER_MIN
# Merge back minimal metadata for Lasso
meta = df.groupby('conversation_id').first()[['author_id','text']]
data = handle_df.merge(meta, left_index=True, right_index=True)
Add sentiment & simple text features
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
def clean(txt):
txt = txt.lower()
txt = re.sub(r'https?://\S+', '', txt)
txt = txt.translate(str.maketrans('', '', string.punctuation))
return txt
data['clean_text'] = data['text'].astype(str).apply(clean)
data['sentiment'] = data['clean_text'].apply(lambda t: sia.polarity_scores(t)['compound'])
data['char_len'] = data['clean_text'].str.len()
data['word_len'] = data['clean_text'].str.split().str.len()
# Extract shift (business hours vs off‑hours)
data['hour'] = data['first_reply'].dt.hour
data['shift'] = np.where(data['hour'].between(9,18), 'business_hours', 'off_hours')
Define X & y
y = data['support_cost']
X = data[['sentiment','char_len','word_len','shift']]
4. Pre‑processing & Lasso pipeline
OneHotEncoder converts the categorical shift; StandardScaler z‑scores numeric columns so Lasso’s penalty treats all equally.
cat_cols = ['shift']
num_cols = ['sentiment','char_len','word_len']
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first'), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('lasso', Lasso(max_iter=10000, random_state=42))
])
5. Train/test split & hyper‑parameter search
Wrapping steps prevent data leakage. A log‑spaced α search balances sparsity and fit, with five‑fold CV selecting the best value.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {'lasso__alpha': np.logspace(-3, 1, 25)} # 0.001→10
gs = GridSearchCV(pipe, param_grid, cv=5,
scoring='neg_root_mean_squared_error', n_jobs=-1)
gs.fit(X_train, y_train)
print("Best α:", gs.best_params_['lasso__alpha'])
6. Evaluate on the hold‑out set
RMSE shows average dollar‑error per conversation; R2R^2 conveys variance explained.
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: ${rmse:.2f} | R²: {r2:.3f}")
7. Interpret feature importance
The coefficient chart reveals actionable levers: off‑hours requests and long, negative‑tone messages sharply raise handling cost, while neutral sentiment during business hours costs least.
feature_names = np.hstack([
gs.best_estimator_.named_steps['prep'].transformers_[0][1].get_feature_names_out(cat_cols),
num_cols
])
coefs = gs.best_estimator_.named_steps['lasso'].coef_
imp = pd.Series(coefs, index=feature_names).sort_values(key=abs, ascending=False)
plt.figure(figsize=(7,4))
imp.plot(kind='barh'); plt.gca().invert_yaxis()
plt.title('Top Drivers of Support Cost'); plt.xlabel('Coefficient (Δ USD)'); plt.show()
Summary
This notebook demonstrates how a Lasso‑based pipeline converts raw social‑support chats into a dollar forecast of service cost—all before an agent types a word. By exposing the most expensive request attributes, support leaders can:
- Prioritise automation or senior agents for high‑cost tickets.
- Schedule staff more efficiently around off‑hours spikes.
- Monitor cost trends every week—re‑training is a single fit() thanks to the encapsulated Pipeline.
The result: proactive, cost‑aware customer‑support operations driven by transparent ML, not guesswork.