During the summer of 2017, I was a fellow at Insight where I was a consultant for a company called MaestroQA. They provide a web platform used by quality assurance teams to assess the quality of their customer interactions. The ultimate goal of the consulting project was to classify chat and email customer service interactions as good or bad interactions. How do customer service interactions impact our lives? Well, a few years ago I spent 4 hours of my birthday over the phone with a cable company trying to downgrade my plan. You probably have seen those customer service videos. Yes, that was me on my birthday. But, coming back to the story. I could start this blog post by telling you all about how I came up a machine learning pipeline all by myself. But the truth is that my project is the result of a great collaboration with the CTO of the company, Harrison Hunter. One of the things that I enjoyed the most during my time at Insight was that collaborative process with Harrison, as well as discussing ideas getting input from the people at Insight. Now that I told you a little bit about the process, I can go technical and tell you all about the project. Enjoy!
MaestroQA
MaesroQA is a web platform used by different companies to improve their customer service interactions. It has different features and it can be used in different ways but one way it can be used is to identify tickets that have low quality (bad) e.g., a chat conversation where the agent was rude to the customer. The manager of the quality assurance team using MaestroQA would like a way to identify those interactions that have low quality (bad) to then coach the agent into doing a better job. So, wouldn’t it be great to have a tool in MaestroQA that would filter out the bad tickets? That is the goal of the project, to build a classification algorithm for MaestroQA that could identify the bad tickets.
What are bad tickets?
Tickets are the email and chat conversations between the agents and the customers. But, what are bad tickets? Well, this a good question and one topic that I had to discuss with the CTO of MaestroQA. The data consisted of email and chat tickets from different companies that were already graded by managers with a quality (QA) score, where a low QA meant low quality or bad. These graded tickets then could be used to train a classification algorithm.
A company can have a distribution of scores like this (see below) . After a discussion with Harrison, we decided that a score lower than 90 is considered bad, which in this company example gives you approximately 20% of the total tickets as bad tickets.
What is the goal of the classification model and how could I validate it?
The next step was to build a classification model to classify the tickets. But then the question was, how can I validate it? An initial step would be to establish how the user of MaestroQA would use the tool. Again, the user of MaestroQA is the manager of a quality assurance team. For example, if a manager wants to identify 10 bad tickets in a given day, then the purpose of the model would be to identify or filter 10 bad tickets.
Recall and Precision
Recall and precision are metrics calculated in a testing data set that are used to validate a classification model after a model is trained.
Recall = TP/ (TP+FN)
Where TP or true positive is the number of bad tickets that were classified as bad and FN or false negative is the number of bad tickets that were classified as good. Recall is an indication the proportion of bad tickets the model can identify. We don’t want to identify all the true positive tickets (bad tickets) so let’s take a look at precision.
Precision = TP/(TP+FP)
Where FP or false positive is the number of good tickets that were classified as bad tickets. In other words, precision is the proportion of true positives that were classified as true by the model. An ideal model would have 100% precision, and would only identify the bad tickets (below: Perfect Model). This is an ideal situation and it’s hard to achieve but what is the is the next best thing? A model that would output a pool of tickets with more bad tickets compared to random (below: TicketFilter Model). In other words, if randomly there are 20% bad tickets the model can output a pool of tickets where the percentage of bad tickets is greater than 20%. If we use the previous example, without the model, a manager has to read 50 tickets to get 10 bad tickets, then we could consider the model a success if we can reduce the number of tickets the manager has to read to get 10 bad tickets. A way to optimize the model is to aim for a model that has the greatest precision above a recall threshold.
How do we choose our recall threshold?
Since there are thousands of tickets, we can choose a low recall threshold. If we choose a recall of 0.1, then to get 10 bad tickets we would have to run the TicketFilter model in 500 tickets.
Total = (10/recall)/random proportion
Total = (10/0.1)/0.2
Total = 500
Where random proportion is the proportion of bad tickets in the population. Since the companies have thousands of emails, we can have this threshold as a start, and in the future, the model can be optimized for recall. In case a company has fewer tickets, recall has to increase.
One Model for all companies or one model per company?
The goal of TicketFilter is to output a precision greater than random. If different companies have different proportions of bad tickets, optimization for precision would be hard to achieve. In addition, each company may have different number of tickets and therefore different recall thresholds might be useful. To start, I chose to do one model per company. For the following examples, I will show results for the models of one company.
Feature Engineering
After conversations with the CTO of the company, Harrison, we came up with 4 feature categories and decided to do some feature engineering and natural language processing (NLP) to extract them.
Length – features related to the length of the conversation and the response of the agent.
- Frequency of words of different categories
- Assistance (ex. help, assist, etc)
- Gratefulness (ex. thanks)
- Apologetic (ex. sorry, apologize, etc.)
- Sentiment analysis– sentiment of both the user and the agent in the conversation
- CSAT Score – customer satisfaction score. This is the score given by the customer.
Which classification model to choose?
We want a classification algorithm where we could assess the feature importance, for this reason I decided to test the Logistic Regression and Random Forest algorithms.
Comparison of Random Forest and Logistic Regression
The figure below shows the precision and recall curve for the logistic regression and Random Forest models, were each point represents the precision and recall for each probability threshold in the classification. In general, the greater the area the better the algorithm’s performance. The dashed lines indicate the maximum precision found after a recall greater than 0.1. The black line represents the proportion of bad tickets in the population (random). Both Random Forest and logistic regression showed a precision greater than random but Random Forest showed a slightly better precision than the Logistic Regression. The number of tickets an agent had to read to get 10 tickets is reduced by almost 50% in both models.
Adding More Features
The following features were added to determine if they increased precision.
Length
- Time to respond- the time an agent took to respond to the request of the used
- Time to resolve-the time an agent took to resolve the request of the user
- Frequency Words
- Frustration (ex. frustrate)
- Confusion (ex. confuse, unclear, etc.)
- Spelling Errors – spelling errors of the agent
- Frequency of Tags- Tickets had tags, then, the most 10 frequent tags were counted, and each tag was used as a feature.
Adding more features increased slightly the precision and the area under the curve, as shown by the figure.
Which are the most important features?
One of the most important features for the model of this company, as shown by the figure, was the agent’s sentiment and the number of words in the conversation.
Summary and Suggestions
- Built modularized pipeline that MaestroQA can use to build TicketFilter model or models to filter the bad customer service interactions.
- A TicketFilter model has the potential of reducing by 50% the number of tickets a manager has to evaluate in order to get the same number of bad tickets
- Features can be added to test whether they increase precision:
- Similarity of sentences
- Topic modeling can be used as features
- If recall is low for a particular model, then more tickets have to go through the model
- Certain type of bad tickets can be missed by the model. In the future, a model with multi class classification with different bad tickets can be built.