Blog
Reduce wait times, increase CSAT scores. But, how?
Customer satisfaction (CSAT) scores are an indicator of customer loyalty and confidence. It is reasonable to assume that CSAT scores play an important factor in reducing customer churn, increasing recurring revenue, and increasing customer lifetime value.
We analyzed customer and agent chat interactions for factors that impact customer satisfaction (CSAT) scores. Negative CSAT scores are directly correlated with four main factors, three of them specifically around wait time:
- Customers are put in a queue to wait to speak to an agent.
- Customers wait for agents to respond.
- Customers need to be transferred to another agent after speaking with an initial agent.
- Customers using digital communications are timed-out from their chat due to inactivity and must start over
The results show: Wait time significantly impacts CSAT scores
We reviewed CSAT scores against the customer experience for more than 17,000 interactions. As CSAT scores are broken down into a 5 point scale, scores between one and three were consolidated into the negative class, whereas the top two scores were consolidated into the positive class.
Negative CSAT rates (scores between 1 and 3) only occur 20% of the time, but when a customer is timed out, the negative CSAT rate jumps up to 80%.
How ASAPP provides opportunities for higher CSAT scores
Directed automation features such as “agent autosuggest” and automated conversation summary notes reduce agent response times. And, AI-driven knowledge base article retrieval models help agents streamline the troubleshooting process. This has the benefit of reducing current customer wait times for agent response, but also improves throughput, reducing queue times as well.
It is important that customers get to an agent that can actually solve their problems. ASAPP intent classification sorts conversations into different types based on the initial set of utterances by the customer. This classification helps match each customer with an appropriate agent and reduces the need for multiple transfers.
A queue check-in feature checks to see if a queued customer is still available before routing to an agent. This eliminates having the agent spending time to connect when a customer has vacated the line.
As agents gain efficiency and communicate on asynchronous channels they’re able to handle multiple issues at once, further reducing enqueuement times. Small gains in efficiency on an individual conversation level add up to larger effects on throughput—for each agent and for the whole CX team.
Addressing instabilities for few-sample BERT fine-tuning
The costs of BERT Fine-Tuning on small datasets
Fine-tuning BERT or its variants has become one of the most popular and effective methods to tackle natural language processing tasks, especially those with limited data. BERT models have been downloaded more than 5.6 millions of times from Huggingface’s public server.
However, fine-tuning remains unstable, especially when
using the large variant of BERT (BERTLarge) on small datasets, arguably the most impactful use of BERT-style models. Identical learning processes with different random seeds often result in significantly different and sometimes degenerate models following fine-tuning, even though only a few, seemingly insignificant aspects of the learning process are impacted by the random seed (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020). In layman’s terms: every time you train BERT for your task, you get different results. This means you need to train again and again to get a good system. This makes scientific comparison challenging (Dodge et al., 2020) and creates huge costs, which are potentially unnecessary.
While the variance comes from randomness, we hypothesize that the major cause of this instability lies in the optimization process.
Revisiting Few-sample BERT Fine-tuning
We conducted an extensive empirical analysis of BERT fine-tuning optimization behaviors on three aspects to identify the root cause of instability:
- The Optimization Algorithm
- We found that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the main cause of degenerate models.
- The Initialization
- We found that re-initializing the top few layers of BERT stabilizes the fine-tuning procedure.
- The Number of Training Iterations
- We found that the model still requires hundreds of updates to converge.
1. Optimization Algorithm
We observed that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the lead cause of degenerate fine-tuning runs. The following is a pseudo-code of the Adam algorithm (Kingma & Ba, 2014). BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.
Fine-tuning BERT with the original Adam (with bias correction) eradicates almost all degenerate model training outcomes and reduces the variance across multiple randomized trials. Here, we show the test performance distribution of 20 random trials with or without bias correction on four small datasets.
Since the variance is significantly reduced, practitioners can easily get a decent model within only one to five trials instead of fine-tuning up to 20 models and picking the best one.
2. Initialization
We hypothesized that the top pre-trained layers of BERT are specific to the pre-training task and may not transfer to a dissimilar downstream task. We propose to re-initialize the top few layers of BERT to ease the fine-tuning procedure. We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers.
The following figure shows the validation performance with different numbers of re-initialized layers. As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks.
3. Number of Training Iterations
ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge.
Revisiting Existing Methods for Few-sample BERT Fine-tuning
Instability in BERT fine-tuning, especially in few-sample settings, has been receiving significant attention recently. We revisited these methods given our analysis of the fine-tuning process, focusing on the impact of using the debiased Adam instead of BERTAdam.
To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. “Int. Task” stands for transferring via an intermediate task (MNLI), “LLRD” stands for layerwise learning rate decay, “WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.
We found that the standard fine-tuning procedure using bias-corrected Adam already has a fairly small variance, making these more complex techniques largely unnecessary. Moreover, re-initialization and training longer can serve as simple yet hard to beat baselines that outperforms previous methods except “Int. Task’’ on RTE. The reason is that RTE is very similar to MNLI (the intermediate task).
Why this work matters
This work carefully investigates the current, broadly adopted optimization practices in BERT fine-tuning. Our findings significantly stabilize BERT fine-tuning on small datasets. Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.
Our findings are focused on few-sample training scenarios, which opens, or at least eases the way for new applications at reduced data costs. The reduction in cost broadens the accessibility and reduces the energy footprint of BERT-based models. Applications that require frequent re-training are now easier and cheaper to deploy given the reduced training costs. This work also simplifies the scientific comparison between future fine-tuning methods by making training more stable, and therefore easier to reproduce.
Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.
Felix Wu, PhD
Read The Complete Paper:
This work has been accepted and will be published in ICLR 2021. Visit our poster during the virtual conference—Poster Session 2: May 3, 2021, 9 a.m. PDT & May 3, 2021, 11 a.m. PDT—to have some conversations with the authors.
Filling in the missing pieces for automation
Natural language classification is widely adopted in many applications, such as user intent prediction, recommendation systems, and information retrieval. At ASAPP, natural language classification is a core technique behind our automation capabilities.
Conventional classification takes a single-step user query and returns a prediction. However, natural language input from consumers can be underspecified and ambiguous, for example when they are not experts in the domain.
Natural language input can be hard to classify, but it’s critical to get classification right for accurate automation. Our research goes beyond conventional methods, and builds systems that interact with users to give them the best outcome.
Yoav Artzi
For example, in an FAQ suggestion application, a user may issue the query “travel out of country”. The classifier will likely find multiple relevant FAQ candidates, as seen in the figure below. In such scenarios, a conventional classifier will just return one of the predictions, even if it is uncertain and the prediction may not be ideal.
We solve this challenge by collecting missing information from users to reduce ambiguity and improve the model prediction performance. My colleagues Lili Yu, Howard Chen, Sida Wang, Tao Lei and I described our approach in our ACL 2020 paper.
We take a low-overhead approach, and add limited interaction to intent classification. Our goal is two-fold:
- study the effect of interaction on the system performance, and
- avoid the cost and complexities of interactive data collection.
We add simple interactions on top of natural language intent classification, with minimal development overhead, through clarification questions to collect missing information. Those questions can be binary or multi-choice. For example, the question “Do you have an online account?” is binary, with “yes” or “no” as answers. And the question “What is your phone operating system?” is multi-choice, with “OS”, “android” or “Windows” as answers. Given a question, the user responds to the system by selecting one answer from the set. At each turn, the system determines whether to ask an informative question, or to return the best prediction to the consumer.
The illustration above shows a running example of interactive classification in the FAQ suggestion domain. The consumer interacts with the system to find an intent from a list of possibilities. The interaction starts with the consumer’s initial query, “travel out of country”. As our system finds multiple good possible responses, highlighted on the right, it decides to ask a clarification question, “Do you need to activate global roaming service?” When the user responds with ‘yes’ it helps the system narrow down the best response candidate. After two rounds of interaction, a single good response is identified. Our system concludes the interaction by suggesting the FAQ document to the user. This is one full interaction, with the consumer’s initial query, system questions, consumer responses, and the system’s final response.
We select clarification questions to maximize the interaction efficiency, using an information gain criterion. Intuitively, we select the question that provides most information about the intent label by observing its answer. After receiving the consumer’s answer, we update the beliefs of intent labels using Beyes’ rule iteratively. Moreover, we balance between the potential increase in accuracy and the cost of asking additional questions with a learned policy controller that decides whether to ask additional questions or return the final prediction.
We designed two non-interactive data collection tasks to train different model components. This allows us to crowdsource the data at large scale and build a robust system at low cost. Our modeling approach leverages natural language encoding, and enables us to handle unseen intents and unseen clarification questions, further alleviating the need for expensive annotations and improving the scalability of our model.
Our work demonstrates the power of adding user interaction in two tasks: FAQ suggestion and bird identification. The FAQ task provides a trouble-shooting FAQ suggestion given a user query in a virtual assistant application. The bird identification task helps identify bird species from a descriptive text query about bird appearance. When real users interact with our system, given at most five turns of interaction, our approach improves the accuracy of a no-interaction baseline by over 100% on both tasks for simulated evaluation and over 90% for human evaluation. Even a single clarification question provides significant accuracy improvements, 40% for FAQ suggestion and 65% for bird identification in our simulated analysis.
This work allows us to quickly build an interactive classification system to improve customer experience by offering significantly more accurate predictions. It highlights how research and product complete each other in ASAPP: challenging product problems inspire interesting research ideas and original research solutions improve product performance. Together with other researchers, Lili Yu is organizing the first workshop on interactive learning for natural language processing in the coming ACL 2021 conference to further discuss the method, evaluation and scenarios of interactive learning.
Four AI-centered CX insights from fortune 500 CXOs
In recent years, the rapid advance of new data and mobile applications has created a heightened threshold for customer expectations. ASAPP and a suite of forward-looking Chief Experience Officers (CXOs) who represent companies with over $450 billion in market value met to discuss how machine learning (ML), speech recognition, and natural language processing (NLP) are generating higher agent productivity, efficiency, and cost reductions.
Four key AI-centered insights arose from the CXOs who see 2021 as an opportunity to realize the promise of artificial intelligence to radically improve customer experience (CX).
1—Automation creates opportunity for emotive, human-driven service
On the surface, “automation” and “human-driven” seem like two opposing forces. A legacy approach considers automation solely to enable customer self-service, taking human agents out of the customer journey. While self-service will persist in specific applications, automating contact center agents’ repetitive tasks allows a focus on what matters most: providing excellent customer service and representing a brand positively.
AI is opening new avenues to create personalized experiences for customers. With an AI platform, agents know what the customer is facing at a given time, their history with the company, and their communication preferences. Automating their administrative tasks in note-taking and multitasking enables agents to be stronger brand ambassadors in spending more mental energy providing an emotive, high-touch, response to customer needs.
2—AI-driven real-time insights is the next big opportunity for supervisors and coaches
Previously, an in-person presence at call centers afforded managers the ability to monitor and assist agents shoulder-to-shoulder. But in today’s digital workplace, managers have turned to less streamlined methods of using webcam and Slack to support agents. This approach has made it harder for managers to supervise and coach teams, and the introduction of new digital systems has added increasing complexity for front-line agents.
CXOs are beginning to see the promise of ML, NLP, and automatic speech recognition technologies to power live voice transcription. These AI technologies enable managers to supervise and support agents in real-time, guiding agents at the moment they need assistance. After each customer engagement, ML-generated reports and summaries allow managers to digest previous interactions, understand where agents are facing challenges, and improve agent performance. With the AI analyzed data, managers can adjust strategy and coaching in real-time to nimbly respond to the business challenges they face.
In the near future, CXOs expect the confluence of ML, NLP, and automatic speech recognition technologies to provide insight for the next golden opportunity: determining caller intent to more rapidly detect what a caller needs, assess their emotional state, and have them automatically routed to the appropriate agent.
CXOs are excited by the opportunities AI presents. They expect this technology to help their organizations be much more productive and at the same time, differentiate themselves by providing exceptional customer experience.
Michael Lawder
3—Measure what matters for holistic data-driven decision making
Thanks to the advance of ML, businesses are able scale pattern recognition and automation from their own data. In 2021, the businesses we speak to are going beyond “bean-counting” to unearth correlation-driven insights for strategic business decisions. Outliers and anecdotes are steadily coming together to illustrate, for example, that mobile device users are more willing to have synchronous conversations than desktop users—an insight which may affect routing processes. To detect these patterns, CX teams are looking to ensure that they have individuals with the knowledge to contextualize the data and to build systems to reliably measure it.
However, in the effort to become a digital-first business, building a comprehensive data lake remains a challenge. Businesses are still struggling to compile timely, quality data at a granularity that can be integrated with other data sets. The preservation, and architecture, of legacy systems has led to continued data silos that makes it hard for decision-makers to see the big picture in the customer journey. CX leaders should demand more from their IT teams and service providers to streamline this data to successfully arm businesses and teams to make changes.
And it’s not just technical IT teams who have a responsibility in building this data treasury. All employees have a role in ensuring that the business is flagging data for data driven decision making. The first step begins in making a cultural mind shift to view data as an important corporate asset.
4—Today’s AI and digital technology shouldn’t be used with yesterday’s paradigm
Many of the à la carte solutions found in today’s contact centers were built for a different time. In decades past, businesses relied on outsourcing to balance costs and scale service which often came at the cost of the customer experience. In the 2010s, IVRs and chatbots offered a way to triage workloads but rarely provided a stellar experience for customers. Today, many contact centers are left sustaining a costly myriad of legacy systems that were not designed for a cohesive customer experience. A real transformation to improve customer experiences requires a rethink of how the customer journey operates.
At ASAPP, we’re doing this by putting a focus on making people better with AI. This has meant a change in everything we create from the ground up for vertically integrated AI and human productivity. We’re changing how we measure ourselves, and interact with customers. For example, IVRs and legacy systems may deliver cost savings, but they may actually exacerbate customer frustration. An analogy I like to use when describing this new paradigm for CX is like building a train to fly. Instead of spending the significant and inefficient resourcing to make trains fly, at ASAPP, we’re building an airplane.
Chief Experience Officers are excited by a future driven by AI: making organizations highly productive and effective by augmenting human activity and automating the world’s workflows. I can’t wait to see what new insights we’ll unearth at our next meeting.
Want to see what makes people so excited to partner with us? Get in touch, or give me a tweet at @michael_lawder.
Do surveys or sample recordings give you enough VoC?
From network compression to DenseNets
The history of artificial neural networks started in 1961 with the invention of the “Multi-Layer Perceptron” (MLP) by Frank Rosenblatt at Cornell University. Forty years later, neural networks are everywhere: from self-driving cars and internet search engines, to chatbots and automated speech recognition systems.
The DenseNet architecture, connects each layer directly with all subsequent layers (of the same size).
Shallow Networks
When Rosenblatt introduced his MLP he was limited by the computational capabilities of his time. The architecture was fairly simple: The neural network had an input layer, followed by a single hidden layer, which fed into the output neuron. In 1989 Kurt Hornik and colleagues from Vienna University of Technology proved that this architecture is a universal approximator, which loosely means that it can learn any function that is sufficiently smooth—provided the hidden layer has enough neurons and the network is trained on enough data. To this day, Hornik’s result is an important milestone in the history of machine learning, but it had some unintended consequences. As multiple layers were computationally expensive to train, and Hornik’s theorem proved that one could learn everything with just a single hidden layer, the community was hesitant to explore deep neural networks.
Deep Networks
Everything changed as cheap GPUs started to proliferate the market. Suddenly matrix multiplications became fast, shrinking the additional overhead of deeper layers. The community soon discovered that multiple hidden layers allow a neural network to learn complicated concepts with surprisingly little data. By feeding the first hiddenlayer’s output into the second, the neural network could “reuse” concepts it learned early-on in different ways. One way to think about this is that the first layer learns to recognize low-level features (e.g. edges, or round shapes in images), whereas the last layer learns high-level abstractions arising from combinations of these low-level features (e.g. “cat”, “dot”). Because the low-level concepts are shared across many examples, the networks can be far more data-efficient than a single hidden layer architecture.
Network Compression
One puzzling aspect about deep neural networks is the sheer number of parameters that it learns. It is puzzling, because one would expect an algorithm with so many parameters to simply overfit, essentially memorizing the training data without the ability to generalize well. However, in practice, this is not what one observed. In fact, quite the opposite. Neural networks excelled at generalization across many tasks. In 2015 my students and I started wondering why that was the case. One hypothesis was that neural networks had millions of parameters but did not utilize them efficiently. In other words, their effective number of parameters could be smaller than their enormous architecture may suggest. To test this hypothesis we came up with an interesting experiment. If it is true that a neural network does not use all those parameters, we should be able to compress it into a much smaller size. Multilayer perceptrons store their parameters in matrices, and so we came up with a way to compress these weight matrices into a small vector, using the “hashing trick.” In our 2015 ICML paper Compressing Neural Networks with the Hashing Trick we showed that neural networks can be compressed to a fraction of their size without any noticeable loss and accuracy. In a fascinating follow-up publication, Song Han et al. showed in 2016 that if this practice is combined with clever compression algorithms one can reduce the size of neural networks even further, which won the ICLR 2016 best paper award and started a network compression craze among the community.
In a nutshell, we forced the network to store similar concepts across neighboring layers by randomly dropping entire layers during the training process. With this method, we could show that by increasing the redundancy we were able to train networks with over 1000 layers and still improve generalization error.
Kilian Weinberger, PhD
Stochastic Depth
Neural network compression has many intriguing applications, ranging from automatic speech recognition on mobile phones to embedded devices. However, the research community was still wondering about the phenomenon of parameter redundancy within neural networks. The success of network compression seemed to suggest that many parameters are redundant, so we were wondering if we could utilize this redundancy to our advantage. The hypothesis was that if redundancy is indeed beneficial to learning deep networks, maybe controlling it would allow us to learn even deeper neural networks. In our 2016 ECCV paper Deep Networks with Stochastic Depth, we came up with a mechanism to increase the redundancy in neural networks. In a nutshell, we forced the network to store similar concepts across neighboring layers by randomly dropping entire layers during the training process. With this method, we could show that by increasing the redundancy we were able to train networks with over 1000 layers and still improve generalization error.
DenseNets
The success of stochastic depth was scientifically intriguing, but as a method, it was a strange algorithm. In some sense, we created extremely deep neural networks (with over 1000 layers) and then made them so ineffective that the network as a whole didn’t overfit. Somehow this seemed like the wrong approach. We started wondering if we could create an architecture that had similarly strong generalization properties but wasn’t as inefficient.
One hypothesis why the increase in redundancy helped so much was that by forcing layers throughout the network to extract similar features, the early low-level features were available even for later layers. Maybe they were still useful when higher-level features are extracted. We, therefore, started experimenting with additional skip connections that would connect any layer to every subsequent layer. The idea was that in this way each layer has access to all the previously extracted features—which has three interesting advantages:
- It allows all layers to use all previously extracted features.
- The gradient flows directly from the loss function to every layer in the network.
- We can substantially reduce the number of parameters in the network.
Our initial results with this architecture were very exciting. We could create much smaller networks than the previous state-of-the-art, ResNets, and even outperform stochastic depth. We refer to this architecture as DenseNets, and the corresponding publication was honored with the 2017 CVPR best paper award.
A comparison of the DenseNet and ResNet architecture on CIFAR-10. The DenseNet is more accurate and parameter efficient.
If previous networks could be interpreted as extracting a “state” that is modified and passed on from one layer to the next, DenseNets changed this setup so that each layer has access to all the “knowledge” extracted from all previous layers and adds its own output to this collective state. Instead of copying features from one layer to the next, over and over, the network can use its limited capacity to learn new features. Consequently, DenseNets are far more parameter efficient than previous networks and result in significantly more accurate predictions. For example, on the popular CIFAR-10 benchmark dataset, they almost halved the error rate of ResNets. Most impressively, out-of-the-box, they achieved new record performance on the three most prominent image classification data sets of the time: CIFAR-10, CIFAR-100, and ImageNet.
There may be other benefits from the additional skip connections. In 2017, Li et al. examined the loss surface around the local minimum that neural networks converge to. They found that as networks became deeper, these surfaces became highly non-convex and chaotic—increasing the difficulty to find a local minimum that generalizes beyond the training data. Skip-connections smooth out these surfaces, aiding the optimization process. The exact reasons are still the topic of open research.
Why is it important to have a deep AI research team?
Increasing agent concurrency without overwhelming agents
The platform we’ve built is centered around making agents highly effective and efficient while still empowering them to elevate the customer experience. All too often we see companies making painful tradeoffs between efficiency and quality. One of the most common ways this happens with digital / messaging interaction: The number of conversations agents handle at a time (concurrency) gets increased, but the agents aren’t given tools to handle those additional conversations.
In an effort to increase agent output, a relatively ‘easy’ lever to pull is raising the agent’s max concurrency from 2-3 chats to 5+ concurrent chats. However, in practice, making such a drastic change without the right safeguards in place can be counter productive. While agent productivity overall may be higher, it often comes at the expense of customer satisfaction and agent burnout, both of which can lead to churn over time.
This is largely explained by the volatility problem of handling concurrent customers. While there are definitely moments in time where handling 5+ chats concurrently can be manageable and even comfortable for the agent (e.g. because several customers are idle/ slow to respond) at other moments, all 5+ customers may demand attention for high-complexity concerns at exactly the same time. These spikes in demand overwhelm the agent and inevitably leave the customers frustrated by slower responses and resolution.
The ASAPP approach to increasing concurrency addresses volatility in several ways.
Partial automation to minimize agent effort
The ASAPP Customer Experience Performance (CXP) platform blunts the burden of demand spikes that can occur at higher concurrencies by layering in partial automation. Agents can launch auto-pilot functionality at numerous points in the conversation, engaging the system to manage repetitive tasks—such as updating a customer’s billing address and scheduling a technician visit—for the agent.
With a growing number of partial automation opportunities, the system can balance the agents workload by ensuring that at any given time, at least one or two of the agent’s assigned chats require little to no attention. In a recent case study, the introduction of a single partial automation use case increased the agent’s speed on concurrent chats by more than 20 seconds.
Considering factors like agent experience, complexity and urgency of issues they’re already handling, and customer responsiveness, the CXP platform can dynamically set concurrency levels.
Cosima Travis
Real time ranking to help focus the agent
Taking into account numerous factors, including customer wait time, sentiment, issue severity, and lifetime value, the platform can help rank the urgency level of each task on the agent’s plate and this alleviates the burden of trying to decide what to focus on next when agents are juggling a higher number of concurrent conversations.
Dynamic complexity calculator to balance agent workload
We reject the idea of a fixed ‘max slot’ number per agent. Instead, we’re building a more dynamic system that doesn’t treat all chats as equal occupancy. It constantly evaluates how much of an agent’s attention each chat requires, and dynamically adjusts concurrency level for that agent. That helps ensure that customers are well-attended while the agent is not overworked.
At certain points, five chats might feel overwhelming while at others, it can feel quite manageable. Many factors play a role, including the customer’s intent, the complexity of that intent, the agent’s experience, the customer’s sentiment, the types of tools required to resolve the issue, how close the issue is to resolution. These all get fed into a real-time occupancy model which dynamically manages the appropriate level of concurrency for each agent at any given time. This flexibility enables companies to drive efficiency in a way that keeps both customers and agents much happier.
While our team takes an experimental, research-driven approach by testing new features frequently, we are uncompromising in our effort to preserve the highest quality interaction for the customer and agent. In our experience, the only way to maintain this quality while increasing agent throughput is with the help of AI-driven automation and adaptive UX features.