Blog
Filling in the missing pieces for automation
Natural language classification is widely adopted in many applications, such as user intent prediction, recommendation systems, and information retrieval. At ASAPP, natural language classification is a core technique behind our automation capabilities.
Conventional classification takes a single-step user query and returns a prediction. However, natural language input from consumers can be underspecified and ambiguous, for example when they are not experts in the domain.
Natural language input can be hard to classify, but it’s critical to get classification right for accurate automation. Our research goes beyond conventional methods, and builds systems that interact with users to give them the best outcome.
Yoav Artzi
For example, in an FAQ suggestion application, a user may issue the query “travel out of country”. The classifier will likely find multiple relevant FAQ candidates, as seen in the figure below. In such scenarios, a conventional classifier will just return one of the predictions, even if it is uncertain and the prediction may not be ideal.
We solve this challenge by collecting missing information from users to reduce ambiguity and improve the model prediction performance. My colleagues Lili Yu, Howard Chen, Sida Wang, Tao Lei and I described our approach in our ACL 2020 paper.
We take a low-overhead approach, and add limited interaction to intent classification. Our goal is two-fold:
- study the effect of interaction on the system performance, and
- avoid the cost and complexities of interactive data collection.
We add simple interactions on top of natural language intent classification, with minimal development overhead, through clarification questions to collect missing information. Those questions can be binary or multi-choice. For example, the question “Do you have an online account?” is binary, with “yes” or “no” as answers. And the question “What is your phone operating system?” is multi-choice, with “OS”, “android” or “Windows” as answers. Given a question, the user responds to the system by selecting one answer from the set. At each turn, the system determines whether to ask an informative question, or to return the best prediction to the consumer.
The illustration above shows a running example of interactive classification in the FAQ suggestion domain. The consumer interacts with the system to find an intent from a list of possibilities. The interaction starts with the consumer’s initial query, “travel out of country”. As our system finds multiple good possible responses, highlighted on the right, it decides to ask a clarification question, “Do you need to activate global roaming service?” When the user responds with ‘yes’ it helps the system narrow down the best response candidate. After two rounds of interaction, a single good response is identified. Our system concludes the interaction by suggesting the FAQ document to the user. This is one full interaction, with the consumer’s initial query, system questions, consumer responses, and the system’s final response.
We select clarification questions to maximize the interaction efficiency, using an information gain criterion. Intuitively, we select the question that provides most information about the intent label by observing its answer. After receiving the consumer’s answer, we update the beliefs of intent labels using Beyes’ rule iteratively. Moreover, we balance between the potential increase in accuracy and the cost of asking additional questions with a learned policy controller that decides whether to ask additional questions or return the final prediction.
We designed two non-interactive data collection tasks to train different model components. This allows us to crowdsource the data at large scale and build a robust system at low cost. Our modeling approach leverages natural language encoding, and enables us to handle unseen intents and unseen clarification questions, further alleviating the need for expensive annotations and improving the scalability of our model.
Our work demonstrates the power of adding user interaction in two tasks: FAQ suggestion and bird identification. The FAQ task provides a trouble-shooting FAQ suggestion given a user query in a virtual assistant application. The bird identification task helps identify bird species from a descriptive text query about bird appearance. When real users interact with our system, given at most five turns of interaction, our approach improves the accuracy of a no-interaction baseline by over 100% on both tasks for simulated evaluation and over 90% for human evaluation. Even a single clarification question provides significant accuracy improvements, 40% for FAQ suggestion and 65% for bird identification in our simulated analysis.
This work allows us to quickly build an interactive classification system to improve customer experience by offering significantly more accurate predictions. It highlights how research and product complete each other in ASAPP: challenging product problems inspire interesting research ideas and original research solutions improve product performance. Together with other researchers, Lili Yu is organizing the first workshop on interactive learning for natural language processing in the coming ACL 2021 conference to further discuss the method, evaluation and scenarios of interactive learning.
Do surveys or sample recordings give you enough VoC?
From network compression to DenseNets
The history of artificial neural networks started in 1961 with the invention of the “Multi-Layer Perceptron” (MLP) by Frank Rosenblatt at Cornell University. Forty years later, neural networks are everywhere: from self-driving cars and internet search engines, to chatbots and automated speech recognition systems.
The DenseNet architecture, connects each layer directly with all subsequent layers (of the same size).
Shallow Networks
When Rosenblatt introduced his MLP he was limited by the computational capabilities of his time. The architecture was fairly simple: The neural network had an input layer, followed by a single hidden layer, which fed into the output neuron. In 1989 Kurt Hornik and colleagues from Vienna University of Technology proved that this architecture is a universal approximator, which loosely means that it can learn any function that is sufficiently smooth—provided the hidden layer has enough neurons and the network is trained on enough data. To this day, Hornik’s result is an important milestone in the history of machine learning, but it had some unintended consequences. As multiple layers were computationally expensive to train, and Hornik’s theorem proved that one could learn everything with just a single hidden layer, the community was hesitant to explore deep neural networks.
Deep Networks
Everything changed as cheap GPUs started to proliferate the market. Suddenly matrix multiplications became fast, shrinking the additional overhead of deeper layers. The community soon discovered that multiple hidden layers allow a neural network to learn complicated concepts with surprisingly little data. By feeding the first hiddenlayer’s output into the second, the neural network could “reuse” concepts it learned early-on in different ways. One way to think about this is that the first layer learns to recognize low-level features (e.g. edges, or round shapes in images), whereas the last layer learns high-level abstractions arising from combinations of these low-level features (e.g. “cat”, “dot”). Because the low-level concepts are shared across many examples, the networks can be far more data-efficient than a single hidden layer architecture.
Network Compression
One puzzling aspect about deep neural networks is the sheer number of parameters that it learns. It is puzzling, because one would expect an algorithm with so many parameters to simply overfit, essentially memorizing the training data without the ability to generalize well. However, in practice, this is not what one observed. In fact, quite the opposite. Neural networks excelled at generalization across many tasks. In 2015 my students and I started wondering why that was the case. One hypothesis was that neural networks had millions of parameters but did not utilize them efficiently. In other words, their effective number of parameters could be smaller than their enormous architecture may suggest. To test this hypothesis we came up with an interesting experiment. If it is true that a neural network does not use all those parameters, we should be able to compress it into a much smaller size. Multilayer perceptrons store their parameters in matrices, and so we came up with a way to compress these weight matrices into a small vector, using the “hashing trick.” In our 2015 ICML paper Compressing Neural Networks with the Hashing Trick we showed that neural networks can be compressed to a fraction of their size without any noticeable loss and accuracy. In a fascinating follow-up publication, Song Han et al. showed in 2016 that if this practice is combined with clever compression algorithms one can reduce the size of neural networks even further, which won the ICLR 2016 best paper award and started a network compression craze among the community.
In a nutshell, we forced the network to store similar concepts across neighboring layers by randomly dropping entire layers during the training process. With this method, we could show that by increasing the redundancy we were able to train networks with over 1000 layers and still improve generalization error.
Kilian Weinberger, PhD
Stochastic Depth
Neural network compression has many intriguing applications, ranging from automatic speech recognition on mobile phones to embedded devices. However, the research community was still wondering about the phenomenon of parameter redundancy within neural networks. The success of network compression seemed to suggest that many parameters are redundant, so we were wondering if we could utilize this redundancy to our advantage. The hypothesis was that if redundancy is indeed beneficial to learning deep networks, maybe controlling it would allow us to learn even deeper neural networks. In our 2016 ECCV paper Deep Networks with Stochastic Depth, we came up with a mechanism to increase the redundancy in neural networks. In a nutshell, we forced the network to store similar concepts across neighboring layers by randomly dropping entire layers during the training process. With this method, we could show that by increasing the redundancy we were able to train networks with over 1000 layers and still improve generalization error.
DenseNets
The success of stochastic depth was scientifically intriguing, but as a method, it was a strange algorithm. In some sense, we created extremely deep neural networks (with over 1000 layers) and then made them so ineffective that the network as a whole didn’t overfit. Somehow this seemed like the wrong approach. We started wondering if we could create an architecture that had similarly strong generalization properties but wasn’t as inefficient.
One hypothesis why the increase in redundancy helped so much was that by forcing layers throughout the network to extract similar features, the early low-level features were available even for later layers. Maybe they were still useful when higher-level features are extracted. We, therefore, started experimenting with additional skip connections that would connect any layer to every subsequent layer. The idea was that in this way each layer has access to all the previously extracted features—which has three interesting advantages:
- It allows all layers to use all previously extracted features.
- The gradient flows directly from the loss function to every layer in the network.
- We can substantially reduce the number of parameters in the network.
Our initial results with this architecture were very exciting. We could create much smaller networks than the previous state-of-the-art, ResNets, and even outperform stochastic depth. We refer to this architecture as DenseNets, and the corresponding publication was honored with the 2017 CVPR best paper award.
A comparison of the DenseNet and ResNet architecture on CIFAR-10. The DenseNet is more accurate and parameter efficient.
If previous networks could be interpreted as extracting a “state” that is modified and passed on from one layer to the next, DenseNets changed this setup so that each layer has access to all the “knowledge” extracted from all previous layers and adds its own output to this collective state. Instead of copying features from one layer to the next, over and over, the network can use its limited capacity to learn new features. Consequently, DenseNets are far more parameter efficient than previous networks and result in significantly more accurate predictions. For example, on the popular CIFAR-10 benchmark dataset, they almost halved the error rate of ResNets. Most impressively, out-of-the-box, they achieved new record performance on the three most prominent image classification data sets of the time: CIFAR-10, CIFAR-100, and ImageNet.
There may be other benefits from the additional skip connections. In 2017, Li et al. examined the loss surface around the local minimum that neural networks converge to. They found that as networks became deeper, these surfaces became highly non-convex and chaotic—increasing the difficulty to find a local minimum that generalizes beyond the training data. Skip-connections smooth out these surfaces, aiding the optimization process. The exact reasons are still the topic of open research.
Increasing agent concurrency without overwhelming agents
The platform we’ve built is centered around making agents highly effective and efficient while still empowering them to elevate the customer experience. All too often we see companies making painful tradeoffs between efficiency and quality. One of the most common ways this happens with digital / messaging interaction: The number of conversations agents handle at a time (concurrency) gets increased, but the agents aren’t given tools to handle those additional conversations.
In an effort to increase agent output, a relatively ‘easy’ lever to pull is raising the agent’s max concurrency from 2-3 chats to 5+ concurrent chats. However, in practice, making such a drastic change without the right safeguards in place can be counter productive. While agent productivity overall may be higher, it often comes at the expense of customer satisfaction and agent burnout, both of which can lead to churn over time.
This is largely explained by the volatility problem of handling concurrent customers. While there are definitely moments in time where handling 5+ chats concurrently can be manageable and even comfortable for the agent (e.g. because several customers are idle/ slow to respond) at other moments, all 5+ customers may demand attention for high-complexity concerns at exactly the same time. These spikes in demand overwhelm the agent and inevitably leave the customers frustrated by slower responses and resolution.
The ASAPP approach to increasing concurrency addresses volatility in several ways.
Partial automation to minimize agent effort
The ASAPP Customer Experience Performance (CXP) platform blunts the burden of demand spikes that can occur at higher concurrencies by layering in partial automation. Agents can launch auto-pilot functionality at numerous points in the conversation, engaging the system to manage repetitive tasks—such as updating a customer’s billing address and scheduling a technician visit—for the agent.
With a growing number of partial automation opportunities, the system can balance the agents workload by ensuring that at any given time, at least one or two of the agent’s assigned chats require little to no attention. In a recent case study, the introduction of a single partial automation use case increased the agent’s speed on concurrent chats by more than 20 seconds.
Considering factors like agent experience, complexity and urgency of issues they’re already handling, and customer responsiveness, the CXP platform can dynamically set concurrency levels.
Cosima Travis
Real time ranking to help focus the agent
Taking into account numerous factors, including customer wait time, sentiment, issue severity, and lifetime value, the platform can help rank the urgency level of each task on the agent’s plate and this alleviates the burden of trying to decide what to focus on next when agents are juggling a higher number of concurrent conversations.
Dynamic complexity calculator to balance agent workload
We reject the idea of a fixed ‘max slot’ number per agent. Instead, we’re building a more dynamic system that doesn’t treat all chats as equal occupancy. It constantly evaluates how much of an agent’s attention each chat requires, and dynamically adjusts concurrency level for that agent. That helps ensure that customers are well-attended while the agent is not overworked.
At certain points, five chats might feel overwhelming while at others, it can feel quite manageable. Many factors play a role, including the customer’s intent, the complexity of that intent, the agent’s experience, the customer’s sentiment, the types of tools required to resolve the issue, how close the issue is to resolution. These all get fed into a real-time occupancy model which dynamically manages the appropriate level of concurrency for each agent at any given time. This flexibility enables companies to drive efficiency in a way that keeps both customers and agents much happier.
While our team takes an experimental, research-driven approach by testing new features frequently, we are uncompromising in our effort to preserve the highest quality interaction for the customer and agent. In our experience, the only way to maintain this quality while increasing agent throughput is with the help of AI-driven automation and adaptive UX features.
Why is it important to have a deep AI research team?
Our big product visions are unlocked by our deep research capabilities
Bringing State of the Art Speech Transcription to CX
Automatic Speech Recognition (ASR) has been a cornerstone capability for voice contact centers for years, enabling agents to review what was just said, or to review older calls to better gather context, and facilitating a whole suite of quality assurance and analytics capabilities. Because ASAPP specializes in serving large enterprise customers with a plethora of data, we’re always looking for ways to improve the scalability and performance of our speech-to-text models; even small wins in accuracy, for example, can translate into huge gains for our customers. Accordingly, we’ve recently made a strategic switch from a hybrid ASR architecture to a more powerful end-to-end neural model. Since adopting this new model we’ve been able to reduce the lower median latency of our model by over 50%, increase the accuracy, and lower the cost of running the model.
To understand why we made this strategic technological shift and how we achieved these results it helps to understand the status quo in real time transcription for contact centers. Often a hybrid model is used which combines separate complementary components. The first component is an acoustic model that translates the raw audio signal into phonemes, the basic units of human speech. Unfortunately the audio data alone can’t be used to construct a sentence of words, since phonemes can be combined in many different ways to construct words. To solve these ambiguities, a lexicon is used to map phonemes to possible words, and a third component, a language model, picks the most likely phrase or sentence from several candidates. This type of pipeline of separate components has been used for decades.
While hybrid architectures have been the standard, they have their limitations. First, because the acoustic model has been trained separately from the language model, they are not quite as powerful as a single larger model. In our new end-to-end architecture, the encoder gives a richer piece of data to the decoder than just phonemes; moreover the pieces of our architecture are all trained together, so they learn to work well together.
The nexus of GPU power, better modeling techniques, and bigger datasets enables better, faster models to serve our customers’ speech transcription needs
The separation of the model components in the legacy architecture has another constraint: it starts to get diminishing returns from more data. In contrast, our new integrated architecture requires more data, but also continues to improve more dramatically as we train it on new data. In other words, this new model is better able to take advantage of the large amounts of data that we encounter working with enterprise customers. Some of this data is text without audio or vice versa and leveraging it allows us to further boost model performance without expensive transcription annotation by humans. It’s worth noting the power of modern GPUs has catalyzed the success of these new techniques, enabling these larger jointly trained models to train on larger datasets in reasonable amounts of time.
Once trained, we can tally up all the metrics and see improvements across the board: The training process is simpler and easier to scale, it’s twice as cheap, and twice as fast*. The model also balances real time demand with historical accuracy: the model waits a few milliseconds to consider audio slightly into the future, giving it more context to predict the right words in virtually real time; finally the model contains a rescoring component that utilizes a larger window of audio to commit an even more accurate transcription to the historical record. Both our real time and historical transcription capabilities are advancing the state of the art.
ASAPP E2E Performance By the Numbers
This was not an easy task. ASAPP has a world class team that continuously looks for ways to improve our speech capabilities. The nexus of GPU power, better modeling techniques, and bigger datasets reduces the need for an external language model and enables them to train the whole thing end to end. These improvements translate into better, faster models that our customers can leverage for their speech transcription needs.
e ASAPP next generation E2E speech recognition system, and on scaling speech out to all of our customers.
Are you missing key revenue growth opportunities?
Is your customer experience always getting smarter?
Pressure to deliver better customer experiences at lower cost is a real challenge. Balancing these two competing priorities is driving innovation, with the real paradigm shift coming from innovations in machine learning and AI.
Automation already plays a role in reducing costs, but when the sole focus is on efficiency, the customer experience often suffers. You risk frustrating customers who express their dissatisfaction in social media and worse, leaving your brand. I think the future of customer experience is using AI to create a self-learning, digital-first business—one that gets smarter all the time—to address many challenging factors at once.
Machine learning is key. It deepens your understanding about customers to better identify where automation works best, and how to personalize interactions across channels. And it empowers agents with predictive knowledge to make them significantly more efficient and productive.
Automate away the routine, not the CX
Ideally, the goal of automation is to accelerate service delivery and resolution for customers, in ways that improve customer experience and lower costs. However, automation should not be exclusively about eliminating human involvement. Satisfying customers without live intervention needs to be part of it; but you also want labor-saving technology that makes live agents more efficient and effective.
Chatbots are great for automating simple tasks. But it would take an army of people to imagine and program all the possible scenarios to fully replicate the experience of speaking with a live agent. That’s why I think automation is most empowering with a system that continuously learns from agents the right thing to say and do in every situation. You can then apply those learnings to automate more and more interactions into predictive suggestions. And over time, those suggestions become actions that the system can automatically handle on behalf of an agent, leaving the agent to more complicated tasks that require a human touch.
In other words, automation becomes the brain of the process, not the process itself. Yes, it powers automated self-service. AND with predictive knowledge, it shortens the time it takes for agents to address issues, which means they can serve customers better, faster, and easier.
Grow smarter agents, smarter channels
Even if you automate away common tasks, many situations still need the human touch. You want those experiences to keep getting smarter as well. Empowering agents with machine learning predictive knowledge ensures they can handle any situation as effectively as your best agent. Real-time conversational analytics and machine learning fuel proactive suggestions that make agents more efficient at handling complex conversations, so every agent can address specialized topics and scenarios.
Intelligent, labor-saving technology also helps solve the common customer complaint about fragmented experiences. People get frustrated when they can’t get help through their preferred digital channels, and even more annoyed if they need to switch channels mid-conversation and have to start all over again.
An integrated, self-learning platform enables seamless continuity across all service channels. Digital messaging, in particular, allows customers to pause a conversation (and even jump from chat to texting or social media to chat), while keeping a continuous thread going until they get everything they need. A smart system ensures they don’t have to start over, saving time and effort for both customers and agents.
At every company I’ve ever worked, any time we delivered a great experience to a customer, their lifetime value went up. Delivering smarter, faster, more personal customer service is at the heart of every great customer experience.
Michael Lawder
Increase value with continuous learning
Enabling a customer service organization to continuously get smarter is one of the things I love most about AI. Over time, you keep learning new ways to automate for efficiency, new ways to help agents work more productively — and also new ways to extract value from a wealth of data.
An AI-driven system enables you to harness volumes of data from every conversation across every channel. It analyzes real-time voice transcription and text data from multi-channel digital messaging for increasingly valuable insights you can put to use. It also factors in voice of the customer data from across the enterprise for more informed decision-making.
Intelligent conversational analytics give you a competitive edge. You can better know your customers to provide more personalized support. You can equip agents to resolve issues faster. And you can ensure the knowledge of your best agents is available for everyone to use.
It’s the ultimate digital-first strategy, enabling companies to optimize customer service and CX in very focused ways that increase satisfaction and drive loyalty.
But wait, there’s more. Conversational insights also deliver value well beyond the contact center. Sales and marketing can gain substantially deeper understanding of customer concerns, buying patterns, and decision drivers. This enables the business to deliver more relevant and personalized predictive offers to increase revenue and marketing ROI.
Go big with transformative results
I’ve been in customer experience for over two decades, starting as a call center agent long ago, and only now am I seeing AI really deliver transformative results. ASAPP enables businesses to continuously get smarter, reinventing customer service in ways that translate into retention and brand loyalty to improve the bottom line.