Blog

All
Browse by topic
Agents
Automation
Customer Experience
Measuring Success
Articles

AutoSummary's 3R Framework Raises the Bar for Agent Call Notes

by 
Nirmal Mukhi
Article
Video
Sep 9
2 mins

Taking notes after a customer call is essential for ensuring that key details are recorded and ready for the next agent, yet it can be difficult to prioritize when agents have other tasks competing for their time. Could automated systems help bridge this gap while still delivering high-quality information? How should the data from customer interactions be organized so that it is useful and easily accessed in the future?

As we were developing AutoSummary, the ASAPP AI Service for automating call dispositioning, we asked our customers for input. ASAPP conducted customer surveys and discovered that agent notes needed to include Reason, Resolution, and Result for every conversation. This 3R Framework was key to success. Here’s a more detailed explanation:

  1. Reason – Agent notes need to focus on the reason for the customer interaction. This crucial bit of data, if accurately noted, immediately helps the next agent assisting the same customer with their issue. They’re able to dig into earlier details and resolve issues more quickly and efficiently while also impressing customers with their empathy and understanding of the situation.
  2. Resolution – It is essential to document the steps taken toward resolution if an agent needs to continue where another left off. When an agent clearly understands the problem and its context, it becomes much easier to follow a series of steps or flowcharts to resolve.
  3. Result – All interactions have a result that should be documented. This allows future customer service agents to see whether the problem was solved effectively, as well as any other important details.
ASAPP - AutoSummary

ASAPP designed AutoSummary to automate dispositioning using the 3R framework as a foundation. And, depending on the needs of the customer, AutoSummary can also provide additional information, like an analytics-ready structured representation of the steps taken during a call. We created AutoSummary with two goals in mind:

  1. Maintain a high bar for what’s included: A summary is, essentially, a brief explanation of the main points discussed during an interaction. Although summaries lengthen as conversations continue, we maintain a limit so that agents can read the note and become caught up in 10-20 seconds. We also eliminate any data that could be superfluous or inaccurate. Our strict standards guarantee a quality output while still being concise.
  2. Engineer for model improvement: While AutoSummary creates excellent summaries, a fundamental component of all ASAPP’s AI services is the power to rapidly learn from continuous usage. We designed a feedback system and work with our customers so that any changes agents make to the generated notes are fed back into our models. Thus, we’re constantly learning from what the agents do – and over time, as the model improves, we receive fewer modifications.

We’re always learning what our customers want and translating that into effective product design. For us, it’s been great to see how successful these summaries are in terms of business metrics such as customer satisfaction, brand loyalty, and agent retention. We strongly believe that good disposition notes for all customer interactions improve every metric mentioned above–and more!

On average, our customers who use Autosummary save over a minute of call handling time per interaction, which saves them millions of dollars a year. Who wouldn’t want those kinds of results?

AI Research
Measuring Success
R&D Innovations
Speech-to-text
Transcription

Utilizing Pre-trained Language Model for Speech Sentiment Analysis

by 
Suwon Shon
Article
Video
Sep 3
2 mins

The future of real-time speech sentiment analysis shows promise in offering new capabilities for organizations seeking to understand how customers feel about the quality of service received across customer service interactions. By understanding customer sentiment the moment customers say it, organizations are equipped with the intelligence to make nimble changes in service. To date, customer feedback surveys fulfilled this purpose but present with some known limitations.

In addition to the low percentage of customers who fill out surveys, customer feedback surveys have a problem with bias: customers are more likely to respond to a survey when having either a positive or negative experience, thus heavily skewing results to positive and negative feedback. With low response rates and biased results, it’s hard to argue that surveys provide a complete picture of the customer experience. Helping fill out this picture, future speech sentiment analysis capabilities offer another way for organizations to evaluate all of the interactions a customer has.

By collecting more information from every call (and not just a few polarized survey responses), speech sentiment could be a way to reduce bias and provide a more comprehensive measure of the customer experience. Future capabilities, which can measure real-time attitude and opinion regarding the service customers receive, can equip organizations with intelligence to make swift shifts in agent coaching or experience design. As more contact center agents work from home, access to live sentiment insight could be a great way for supervisors to support agents on a moment’s whim without needing to be in the same office.

Current methods in speech sentiment analysis are bringing us closer to realizing these real-time sentiment analysis capabilities, but several research hurdles remain in acquiring the right dataset to train these models. Medhat et. al 2014 illustrate how current NLP sentiment data comes in the form of written text reviews, but this is not the right kind of data needed for speech analysis of conversational recordings. Even when audio data is available, it often arrives in limited scripted conversations repeated from a single actor or monologue–which is insufficient for sentiment analysis on natural conversations.

As we work to advance the state of the art in speech sentiment analysis, new ASAPP research presented at Interspeech 2021 is making progress in lowering these barriers.

The Conventional Approach

While ASAPP’s automatic speech recognition (ASR) system is a leader in speech-to-text performance, conventional methods of using cascading ASR and text-based natural language processing (NLP) sentiment analysis systems have several drawbacks.

Large language models trained on text-based examples for sentiment analysis show a large drop in accuracy when applied to transcribed speech. Why? We speak differently than how we write. Spoken language and written language lie in different domains, so the language model trained on written language (e.g. BERT was trained using BooksCorpus and Wikipedia) does not perform well on spoken language input.

ASAPP—Figure 1. Examples that illustrate the differences between chat and voice.
Figure 1. Examples that illustrate the differences between chat and voice.

Furthermore, abstract concepts such as sarcasm, disparagement, doubt, suspicion, yelling, or intonation further complicate the complexity of speech sentiment recognition over an already challenging task of text-based sentiment analysis. Such systems lose rich acoustic/prosodic information which is critical to understanding spoken language (such as changes in pitch, intensity, raspy voice, speed, etc).

Speech annotation for training sentiment analysis models has been offered as a way to overcome this obstacle for controlled environments [Chen et. al, 2020], but is costly in collection efforts. While publicly available text can be found virtually everywhere–from social media to English literature, acquiring conversational speech with the proper annotations is harder given limited open-source availability. And, unlike sentiment-annotated text, speech annotations have to require more time listening to the speech.

ASAPP Research: Leveraging Pre-trained Language Model for Speech Sentiment Analysis

Leveraging pre-training neural networks is a popular way to save the annotation resource on downstream tasks. In the field of NLP, great advances have been made through pre-training task-agnostic language models without any supervision, e.g. BERT. Similarly, in the study of Spoken Language Understanding (SLU), pre-training approaches were proposed in combination with ASR or acoustic classification modules to improve SLU performance under limited resources.

The aforementioned pre-training approaches only focus on how to pre-train the acoustic model effectively with the assumption that if a model is pre-trained to recognize words or phonemes, the fine-tuning result of downstream tasks will be improved. However, they did not consider transferring information from the language model that had already been trained with a lot of written text data to the conversational domain.

We propose the use of powerful pre-trained language models to transfer more abstract knowledge from the written text-domain to speech sentiment analysis. Specifically, we leverage pre-trained and fine-tuned BERT models to generate pseudo labels to train a model for the end-to-end (E2E) speech sentiment analysis system in a semi-supervised way.

ASAPP—Figure 2. Proposed speech sentiment analysis system.
Figure 2. Proposed speech sentiment analysis system.

For the E2E sentiment analysis system, a pre-trained ASR encoder is needed to prevent overfitting and encode speech context efficiently. To transfer the knowledge from the text domain, we generated pseudo sentiment labels from either ASR transcript or ground truth human transcript. The pseudo labels can be used to pre-train the sentiment classifier in the semi-supervised training phase. In the fine-tuning phase, the sentiment classifier can be trained with any speech sentiment dataset we want to use. Target domain matched speech sentiment dataset would give the best result in this phase. We verified our proposed approach using a large scale Switchboard sentiment dataset [Chen et al. 2020].

Suwon Shon
Transfer learning between spoken and written language domains was not actively addressed before. This work found that pseudo sentiment labels obtained from a pre-trained model trained in the written text-domain can transfer the general sentiment knowledge into the spoken language domain using a semi-supervised training framework. This means that we can train the network more efficiently with less human supervision.

Suwon Shon, PhD

Why this matters

Transfer learning between spoken and written language domains was not actively addressed before. This work found that pseudo sentiment labels obtained from a pre-trained model trained in the written text-domain can transfer the general sentiment knowledge into the spoken language domain using a semi-supervised training framework.

ASAPP—Figure 3. Semi-supervised training efficiency on evaluation set. Note that baseline used all of SWBD-train set (86h)
Figure 3. Semi-supervised training efficiency on evaluation set. Note that baseline used all of SWBD-train set (86h)

This means that we can train the network more efficiently with less human supervision. From the experiment in Figure 3 we can save about 65% (30h vs. 86h) of human sentiment annotation using our pseudo label-based semi-supervised training approach. On the other hand, this also means that we can boost the performance of sentiment analysis when we use the same amount of sentiment annotated training set. We observe that the best system showed about 20% improvement on unweighted F1 score (57.63%) on the evaluation set compared to the baseline (48.16%).

ASAPP—Table 1. Semi-supervised approach on E2E speech sentiment analysis system. You can find a more detailed evaluation results in the preprint.
Table 1. Semi-supervised approach on E2E speech sentiment analysis system. You can find a more detailed evaluation results in the preprint.

Lastly, we observed that using ASR transcripts for pseudo labels gives a slight performance degradation, but still shows better performance than the baseline. This result allows us to use a huge unlabeled speech for a semi-supervised training framework without any human supervision.

Read the Paper

AI Research
R&D Innovations
Speech-to-text
Transcription
Articles

Multi-mode ASR: Increasing Robustness with Dynamic Future Contexts

by 
Kwangyoun Kim
Article
Video
Aug 27
2 mins

Automatic Speech Recognition (ASR), as its name indicates, is a technology tasked with deriving text transcriptions from auditory speech. As such, ASR is the backbone that provides real-time transcriptions for downstream tasks. This includes critical machine learning (ML) and natural language processing (NLP) tools that help human agents reach optimal performance. Downstream ML/NLP examples include auto-suggest features for an agent based on what a customer is saying during a call, creating post-call summary notes from what was said, or intent classification, i.e. knowing what a customer is calling for to pair them with the most appropriate agent. Crucial to the success of these AI systems is the accuracy of speech transcriptions. Only by accurately detecting what a customer or agent is saying in real-time, can we have AI systems provide insights or automate tasks accordingly.

A key way to improve this accuracy is to provide more surrounding speech information to the ASR model. Rather than having an ASR model predict what a speaker is saying only based on what’s said before, by also using what’ll be said as future context, is a model able to better predict and detect the difference between someone who said: “I’m going to the cinema today [to watch the new James Bond]” versus “I’m going to the cinema to date… [James Bond].” When we predict words, using speech frames from future utterances gives more context. And by utilizing more context, some of the errors which emerge from the limitation of the method relying on past context only can be fixed.

ASAPP—ASAPP—Rather than having an ASR model predict what a speaker is saying only based on what’s said before, by also using what’ll be said as future context, is a model able to better predict and detect the difference between someone who said: “I’m going to the cinema today [to watch the new James Bond]” versus “I’m going to the cinema to date… [James Bond].”

A downside to the increased accuracy of the longer contextual approach with future speech frames comes with a trade-off in latency and speed for waiting and computing future frames. Latency constraints vary depending upon services and applications. People usually train the best model at a given latency requirement. You would compromise the accuracy of an ASR model if it were used in a different latency condition from the one incurred to model training. To meet various scenarios or service requirements with this approach thus means that several different models would have to be trained separately—making development and maintenance difficult, which is a scalability issue.

At ASAPP, we require the highest accuracy and lowest latency to achieve true real-time insights and automation. However, given our diverse product offerings with different latency requirements, we also try to address the scalability issue efficiently. So to overcome this challenge, research accepted at Interspeech 2021 takes a new approach with an ASR model that dynamically adjusts its latency based on different constraints without the accuracy compromise, which we refer to as Multi-mode ASR.

The ASAPP Research: A Multi-mode Transformer Transducer with Stochastic Future Context

Our work expands upon previous research on dual-mode ASR (Yu et al., 2020). A Transformer model has the same structure for both the full context model and the streaming model: the full context model uses unlimited future context and the streaming model uses limited future context (e.g., 0 or 1 future speech frame per each neural layer, where a frame requires 10ms speech and we use 12 layers). The only difference is that self-attention controls how many future frames the model would access by masking the frames. Therefore, it is possible to operate the same model in full context and streaming mode. Additionally, we can use “knowledge distillation” when training the streaming mode. That is, we train the streaming mode not only on its original objective, but also to have outputs that are similar to the ones produced by the full context mode. This way, we can further bridge the gap between streaming and full context modes. This method remarkably improves the problem of accuracy drop and alignment delay of streaming ASR. We were directly motivated by this method and have been studying to extend it to multiple modes.

Our multi-mode ASR is similar to dual-mode but it is broader and more general. We didn’t limit the streaming mode to a single configuration using only one future context size, but defined it as using a stochastic future context size instead. As described in Figure 1 below, dual-mode ASR is trained on a predefined pair consisting of the full context mode and the zero context (streaming) mode. In contrast, multi-mode ASR trains a model using multiple pairs of the full context mode and the streaming mode with a future context size of C where C is sampled from a stochastic distribution.

ASAPP—Figure 1: A figure of different modes. The black circle indicates the current output step, and bold-lined circles are contexts used for the current output, (a) is a full-context mode, and (b) presents a streaming mode with the future context size of 1. (c) describes our method which randomly selects the future context (shown with dotted circles and arrows).
ASAPP—Figure 1: A figure of different modes. The black circle indicates the current output step, and bold-lined circles are contexts used for the current output, (a) is a full-context mode, and (b) presents a streaming mode with the future context size of 1. (c) describes our method which randomly selects the future context (shown with dotted circles and arrows).

Since C is selected from a distribution for every single minibatch during training, a single model is trained on various future context conditions.

ASAPP—Table 1: Streaming WER on LibriSpeech testset with different future context sizes applied on a single model during inference. c is a future context size at training. The gray field indicates a mismatched condition when the future context sizes used for training and recognition are different.
Table 1: Streaming WER on LibriSpeech testset with different future context sizes applied on a single model during inference. c is a future context size at training. The gray field indicates a mismatched condition when the future context sizes used for training and recognition are different.

We say that evaluation conditions are matched when the training context size and the inference context size are the same, and that they are mismatched otherwise. The results in Table 1 show that a streaming model only works well when it’s matched, i.e., trained and evaluated on past speech alone. . Although the results for the dual-mode trained model are better than the trained-alone model—a result of the knowledge distillation, it also doesn’t work well in the mismatched condition. Contrary to this, it can be confirmed that our proposed multi-mode trained model operates reliably in multiple conditions, because the mismatched condition is eliminated by using a stochastic future context. Looking at the detailed results for each context condition, it can be expected that training for this stochastic future context also can bring regularization effects to a single model.

Kwangyoun Kim
Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.

Kwangyoun Kim

Why this matters

ASR is used in services with various environments and scenarios. To create downstream ML and NLP tasks that produce results within seconds and work well with human workflows, ASAPP’s ASR model must similarly operate in milliseconds based on the situation. Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.

By exposing a single model to various conditions, one model can have the ability to change the amount of used future context needed to meet the latency requirements for a particular application. This makes it easier and more resource-efficient to cover all different scenarios. Thinking further, if the latency is increased due to unpredictable load in service, it is possible to change the configuration easily on the fly, and it is viable to significantly increase the usability with minimal accuracy degradation. Algorithms for responding to multiple scenarios usually suffer sub-optimal performance problems compared to a model optimized for one condition. But multi-mode ASR shows the possibility that it can easily cover multiple conditions without such problems.

What’s next for us at ASAPP

The paper about this study will be presented at Interspeech 2021 (Wed, Sep 1st, 11:00 ~ 13:00, GMT +2). The method and detailed results are described in that paper. We believe that this research topic is one of the promising directions to effectively support various applications, services, and customers. Research is also underway to extend this method to train a general model by combining it with pre-training methods. We will continue to focus on research on scalability as an important factor in terms of model training and deployment.

Read the Paper

AI Research
R&D Innovations
Speech-to-text
Transcription
Articles

Introducing CLIP: A Dataset to Improve Continuity of Patient Care with Unsupervised NLP

by 
James Mullenbach
Article
Video
Jul 28
2 mins

Continuity of care is crucial to ensuring positive health outcomes for patients, especially in the transition from acute hospital care to out-patient primary care. However, information sharing across these settings is often imperfect.

Hospital discharge notes alone easily top thousands of words and are structured with billing and compliance in mind, rather than the reader, making poring over these documents for important pending actions especially difficult. Compounding this issue, primary care physicians (PCPs) already are short on time—receiving dozens of emails, phone calls, imaging, and lab reports per day (Baron 2010). Lost in this sea of hospital notes and time constraints are important actions for improving patient care. This can cause errors and complications for both patients and primary care physicians.

Thus, in order to improve the continuity of patient care, we are releasing one of the largest annotated datasets for clinical NLP. Our dataset, which we call CLIP, for CLInical Follow-uP, makes the task of action item extraction tractable, by enabling us to train machine learning models to select the sentences in a document that contain action items.

James Mullenbach
By leveraging modern methods in unsupervised NLP, we can automatically highlight action items from hospital discharge notes and action items for primary care physicians–saving them time and reducing the risk that they miss critical information.

James Mullenbach

We view the automatic extraction of required follow-up action items from hospital discharge notes as a way to enable more efficient note review and performance for caregivers. In alignment with the ASAPP mission to augment human activity by advancing AI, this dataset and task provide an exciting test ground for unsupervised learning in NLP. By automatically surfacing relevant historical data to improve communication, this work represents another key way ASAPP is improving human augmentation with AI. In our ACL 2021-accepted paper, we demonstrate this with a new algorithm.

The CLIP Dataset

Our dataset is built upon MIMIC-III (Johnson et al., 2016), a large, de-identified, and open-access dataset from the Beth Israel Deaconess Medical Center in Boston, which is the foundation of much fruitful work in clinical machine learning and NLP. From this dataset, with the help of a team of physicians, we labeled each sentence in 718 full discharge summaries, specifying whether the sentence contained a follow-up action item. We also annotated 7 types to further classify action items by the type of action needed; for example, scheduling an appointment, following a new medication prescription, or reviewing pending laboratory results. This dataset, comprising over 100,000 annotated sentences, is one of the largest open-access annotated clinical NLP datasets to our knowledge, and we hope it can spur further research in this area.

How well does machine learning accomplish this task? In our paper we approach the task as sentence classification, individually labeling each sentence in a document with its followup types, or “No followup”. We evaluated several common machine learning benchmarks on the task, adding some tweaks to better suit the task, such as including more than one sentence as input. We find that the best models, based on the popular transformer-based model BERT, provide a 30% improvement in F1 score, relative to the linear model baseline. The best models achieve an F1 score around 0.87, close to the human performance benchmark of 0.93.

Model pre-training for healthcare applications

We found that an important factor in developing effective BERT-based models was pre-training them on appropriate data. Pre-training exposes models to large amounts of unlabeled data, and serves as a way for large neural network models to learn how to represent the general features of language, like proper word ordering and which words often appear in similar contexts. Models that were pre-trained only on generic data from books or the web may not have enough knowledge on how language is used specifically in healthcare settings. We found that BERT models pre-trained on MIMIC-III discharge notes outperformed the general-purpose BERT models.

For clinical data, we may want to take this focused pre-training idea a step further. Pre-training is often the most costly step of model development due to the large amount of data used. But, can we reduce the amount of data needed, by selecting data that is highly relevant to our end task? In healthcare settings, with private data and less computational resources, this would make automating action item extraction more accessible. In our paper, we describe a method we call task-targeted pre-training (TTP) that builds datasets for pre-training by selecting sentences that look the most like those in our annotated data that do contain action items. We find that it’s possible, and maybe even advantageous, to select data for pre-training in this way, saving time and computational resources while maintaining model performance.

Improving physician performance and reducing cognitive load

Ultimately, our end goal is to make physicians’ jobs easier by reducing the administrative burden of reading long hospital notes, and bring their time and focus back where it belongs: on the patient. Our methods can condense notes down to what a PCP really needs to know, reducing note size by at least 80% while keeping important action items readily available. This reduction in “information overload” can reduce physicians’ likelihood of missing important information (Singh et al., 2013), improving their accuracy and the well-being of their patients. Through a simple user interface, these models could enable a telemedicine professional to more quickly and effectively aid a patient that recently visited the hospital.

Read more and access the data

Our goal with open sourcing CLIP is to enable lots more future work in this area of summarizing clinical notes and reducing physician workload, with our approach serving as a first step. We anticipate that further efforts to incorporate the full document into model decisions, exploit sentence-level label dependencies, or inject domain knowledge will be fruitful. To learn more, visit our poster session at ACL occurring Monday, Aug. 2, 11:00 a.m.—1:00 p.m. ET.

Paper
CLIP Dataset
Code Repository

Citations

Richard J. Baron. 2010. What’s keeping us so busy in primary care? a snapshot from one practice. The New England Journal of Medicine, 362 17:1632–6.
Hardeep Singh, Christiane Spitzmueller, Nancy J. Petersen, Mona K. Sawhney, and Dean F. Sittig. 2013. Information overload and missed test results in electronic health record-based settings. JAMA Internal Medicine, 173 8:702–4.
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li wei H. Lehman, Mengling Feng, Mohammad M. Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. Mimic-III, a freely accessible critical care database. In Scientific Data

AI Research
Automation
Machine Learning
R&D Innovations
Articles

Why I joined ASAPP: Taking AI to new levels in enterprise solutions

by 
Ryan McDonald
Article
Video
Jul 12
2 mins

I have spent the past 20 years working in natural language processing and machine learning. My first project involved automatically summarizing news for mobile phones. The system was sophisticated for its time, but it amounted to a number of brittle heuristics and rules. Fast forward two decades and techniques in natural language processing and machine learning have become so powerful that we use them every day—often without realizing it.

After finishing my studies, I spent the bulk of these 20 years at Google Research. I was amazed at how machine learning went from a promising tool to one that dominates almost every consumer service. At first, progress was slow. A classifier here or there in some peripheral system. Then, progress came faster, machine learning became a first class citizen. Finally, end-to-end learning started to replace whole ecosystems that a mere 10 years before were largely based on graphs, simple statistics and rules-based systems.

After working almost exclusively on consumer facing technologies. I started shifting my interests towards enterprise. There were so many interesting challenges that arose in this space. The complexity of needs, the heterogeneity of data and often the lack of clean, large-scale training sets that are critical to machine learning and natural language processing. However, there were properties that made enterprise tractable. While the complexity of tasks was high, the set of tasks any specific enterprise engaged in was finite and manageable. The users of enterprise technology are often domain experts and can be trained. Most importantly, these consumers of enterprise technology were excited to interact with artificial intelligence in new ways— if it could deliver on its promise to improve the quality and efficiency of their efforts.

This led me to ASAPP.

I am firm in my belief that to take enterprise AI to the next level a holistic approach is required. Companies must focus on challenges with systemic inefficiencies and develop solutions that combine domain expertise, machine learning, data science and user experience (UX) in order to elevate the work of practitioners. The goal is to improve and augment sub-tasks that computers can solve with high precision in order to enable experts to spend more time on more complex tasks. The core mission of ASAPPis exactly in line with this, specifically directed towards customer service, sales and support.

Ryan McDonald
To take enterprise AI—and customer experience—to the next level a holistic approach is required.

Ryan McDonald, PhD

The customer experience is ripe for AI to elevate to the next level. Everyone has experienced bad customer service, but also amazing customer service. How do we understand choices that the best agents make? How do we recognize opportunities where AI can automate routine and monotonous tasks? Can AI help automate non deterministic tasks? How can AI improve the agent experience leading to less burn out, lower turnover and higher job satisfaction? This is in an industry that employs three million people in the United States alone but suffers from an average of 40 percent attrition—one of the highest rates of any industry.

ASAPP is focusing its efforts singularly on the Customer Experience and there are enough challenges here to last a lifetime. But, ASAPP also recognizes that this is the first major step on a longer journey. This is evident in the amazing research group that ASAPP has put together. They are not just AI in name, but also in practice. Our research group consists of machine learning and language technology leaders, many of whom publish multiple times a year. We also have some of the best advisors in the industry from universities like Cornell and MIT. This excites me about ASAPP. It is the perfect combination of challenges and commitment to advanced research that is needed in order to significantly move the needle in customer experience. I’m excited for our team and this journey.

Automation
Contact Center
Customer Experience
Measuring Success
Articles

To realize Forrester’s vision of conversational intelligence, a human focus is needed.

by 
Macario Namie
Article
Video
Jun 16
2 mins

For the CX industry, success always relied on an ability to deliver high-quality customer interactions at scale. The availability of omnichannel opened up new, convenient, avenues for customers to engage with organizations, yet it also increased the volume of interactions needing resolution. But thanks to modern advances in AI research, conversational and speech intelligence is having a new renaissance moment to improve CX revenue and efficiency at this rising scale.

As proof of this trend, Forrester Research released their new Q2 2021 report, “Now Tech: Conversation Intelligence” which names ASAPP among the leading conversation intelligence providers. The report guides forward-looking organizations to harness conversational intelligence in three key areas:

  • Delivering CX insights at scale.
  • Solutions which help organizations understand the voice of the customer and the agent at every interaction.
  • Improving CX behavior at scale.
  • Solutions which monitor and guide agents on what to say, actions to take, or areas to coach an agent.
  • Accelerating revenue.
  • Solutions which give sales teams insights they need to drive a greater volume of better leads and to ensure they are acted upon.

In looking at these areas, it’s no surprise that organizations like American Airlines and DISH are turning to ASAPP for real-time insights that empower customer and sales agents to achieve peak performance. At ASAPP, we believe that intelligence is best deployed at the point of where it matters: in real time where the interactions occur. It’s why we’re committed to advancing true AI which is redefining automation in contact center operations to triple throughput, increase digital adoption and transform contact center operations with lower operational costs.

Macario Namie
Real-time insights from a continuously learning system improve a company’s ability to deliver highly personalized customer experiences—and substantially improve efficiency at the same time.

Macario Namie

This real time conversational Intelligence replaces yesterday’s rules-based systems by capitalizing on the insights of your agents and customers. A rules-based system, whether it feeds chatbots or humans, only captures a fraction of the available knowledge and doesn’t take advantage of the lessons learned from today’s data pools. Rigid rules-based systems aren’t flexible or generalizable for diverse customer needs, as no rules-based system will deliver customized, real-time intelligence that equip agents in what to say the moment actions occur.

It’s time for us to harness conversational intelligence that applies the knowledge of agents at scale. CX leaders who utilize a combination of conversational intelligence and automation understand how this leads to better customer voice and digital experiences that increase Customer Satisfaction (CSAT) and Net Promoter Scores (NPS). It’s why organizations that deploy ASAPP see an exponential improvement in performance that delivers measurable results in less than 60 days.

That’s all to say that we’re proud to see further recognition of ASAPP’s value in conversational intelligence. The Forrester Research report builds on our distinction as a “Cool Vendor” by Gartner. How are you thinking of using conversation intelligence at your organization?

Read the full report by Forrester Research for more details.
See the press release here.

Concurrency
Digital Engagement
Measuring Success
R&D Innovations
Articles

Why AHT isn’t the right measure in an asynchronous and multi-channel world

by 
Heather Reed
Article
Video
Jun 4
2 mins

Operations teams have been using agent handle time (AHT) to measure agent efficiency, manage workforce, and plan operation budgets for decades. However, customers have been increasingly demonstrating they’d prefer to communicate asynchronously—meaning they can interact with agents when it is convenient for them, taking a pause in the conversation and seamlessly resuming minutes or hours later, as they do when multitasking, handling interruptions, and messaging with family and friends.

In this new asynchronous environment, AHT is an inappropriate measure of how long it takes agents to handle a customer’s issue: it overstates the amount of time an agent spends working with a customer. Rather, we consider agent throughput as a better measure of agent efficiency. Throughput is the number of issues an agent handles over some period of time (e.g. 10 issues per hour) and is a better metric for operations planning.

One common strategy for increasing throughput is to merely give agents more issues to handle at once, which we call concurrency. However, attempts to increase throughput by simply increasing an agent’s concurrency without giving them better tools to handle multiple issues at once are short-sighted. Issues that escalate to agents are complex and require significant cognitive load, as “easier” issues have typically already been automated. Therefore naively increasing agent concurrency without cognitive load consideration often results in adverse effects on agent throughput, frustrated customers who want faster response times, and agents who burn out quickly.

The ASAPP solution to this is to use an AI-powered flexible concurrency model. A machine learning model measures and forecasts the cognitive demand on agents and dynamically increases concurrency in an effective way. This model considers several factors including customer behaviors, the complexities of issues, and expected work required to resolve the issue to determine an agent’s concurrency capacity at a given point in time.

We’re able to increase throughput by reducing demands on the agent’s time and cognitive load, resulting in agents more efficiently handling conversations, while elevating the customer experience.

Measuring throughput

In equation form, throughput is the inverse of agent handle time (AHT) multiplied by the number of issues an agent can concurrently handle at once.

ASAPP—In equation form, throughput is the inverse of agent handle time (AHT) multiplied by the number of issues an agent can concurrently handle at once.

For example, if it on average takes an agent half an hour to handle an issue, and she handles two issues concurrently, then her throughput would be 4 issues per hour.

ASAPP—For example, if it on average takes an agent half an hour to handle an issue, and she handles two issues concurrently, then her throughput would be 4 issues per hour.

The equation shows two obvious ways to increase throughput:

  1. Reduce the time it takes to handle each individual issue (reduce the AHT); and
  2. Increase the number of issues an agent can concurrently handle.

At ASAPP, we think about these two approaches to increasing throughput, particularly as customers move to adopt more asynchronous communication.

Heather Reed
AHT as a metric is only applicable when the agent handles one contact at a time—and it’s completed end-to-end in one session. It doesn’t take into account concurrent digital interactions, nor asynchronous interactions.

Heather Reed, PhD

Reducing AHT

The first piece of the throughput-maximization problem entails identifying, quantifying, and reducing the time and effort required for agents to perform the tasks to solve a customer issue.
We think of the total work performed by an agent as both a function of the cognitive load (CL) and the time required to perform a task. This definition of work is analogous to the definition of work in physics, where Work = (Load applied to an object) X (Distance to move the object).

The agents’ cognitive load during the conversations (visualized by the height of the black curve and the intensity of the green bar) are affected by:

  • crafting messages to the customer;
  • looking up external information for the customer;
  • performing work on behalf of the customer;
  • context switching among multiple customers; etc.

The total work performed is the area under the curve, which can be reduced by decreasing the effort (CL) and time to perform tasks. We can compute the average across the interaction—a flat line—and in a synchronous environment, that can be very accurate.

ASAPP—The cognitive load varies throughout the duration of the issue, as shown by the height of the curve and the intensity of the green color. The total work performed is the multiplication of the cognitive load and the time to perform the task
The cognitive load varies throughout the duration of the issue, as shown by the height of the curve and the intensity of the green color. The total work performed is the multiplication of the cognitive load and the time to perform the task

ASAPP automation and agent augmentation features are designed to both reduce handling time and reduce the agents’ cognitive load—the amount of energy it takes to solve a customers’ problem or upsell a prospect. For example Autosuggest provides message recommendations that contain relevant customer information, saving agents the time and effort they would need to spend looking up information about customers (e.g. their bill amount) as well as the time spent physically crafting the message.

For synchronous conversations, that means each call is less tiring. For asynchronous conversations, that means agents can handle an increasing number of issues without corresponding increases in stress.

In some cases, we can completely eliminate the cognitive load from a part of a conversation. Our auto-pilot feature enables automation of entire portions of the interaction—for example, collecting customer’s device information, freeing up agents’ attention.

ASAPP—Augmentation and automation features reduce time and CL to perform tasks during an issue
Augmentation and automation features reduce time and CL to perform tasks during an issue

The result of use of multiple augmentation features during an issue is the reduction of overall AHT as well as reduction of work.

When the customer is asynchronous, the majority of the agent’s time would be spent waiting for the customer to respond. This is not an effective use of the agent’s time, which brings us to the second piece of the throughput-maximization problem.

Increasing concurrency

We can improve agent throughput by increasing concurrency. Unfortunately, this is more complex than simply increasing the number of issues assigned to an agent at once. Issues that escalate to agents are complex and emotive, as customers typically get basic needs met through self-service or automation. If an agent’s concurrency is increased without forecasting workload, then increasing concurrency will actually have an adverse effect on the AHT of individual issues.

If increasing concurrency results in increased AHT, then the impact on overall throughput can be negative. What’s more customers can become frustrated at the lack of response from the agent and bounce to other support channels, or worse—consider switching providers; and agents may feel overwhelmed and risk burning out or churning out

Flexible concurrency

We can alleviate this problem with flexible concurrency: an AI-driven approach to this problem. A machine learning model keeps track of the work the agent is doing, and dynamically increases an agent’s concurrency to keep the cognitive load manageable.

Combined with ASAPP augmentation features, our flexible concurrency model can safely increase an agent’s concurrency, enabling higher throughput and increased agent efficiency.

Without ASAPP—A visual comparison of agent throughput without (top) and with (bottom) ASAPP augmentation and flexible concurrency AI models. With ASAPP, the agent is able to handle several more customer issues concurrently because work required to resolve each issue is reduced.
ASAPP—A visual comparison of agent throughput without (top) and with (bottom) ASAPP augmentation and flexible concurrency AI models. With ASAPP, the agent is able to handle several more customer issues concurrently because work required to resolve each issue is reduced.
A visual comparison of agent throughput without (top) and with (bottom) ASAPP augmentation and flexible concurrency AI models. With ASAPP, the agent is able to handle several more customer issues concurrently because work required to resolve each issue is reduced.

In summary

As customers increasingly prefer to interact asynchronously, AHT becomes less appropriate for operations planning. Throughput (the number of issues within a time period) is a better metric to measure agent efficiency and manage workforce and operations budgets. ASAPP AI-driven agent augmentation paired with a flexible concurrency model enables our customers to safely increase agent throughput while maintaining manageable agent workload—and still deliver an exceptional customer experience.

AI Native®
Customer Experience
Articles

Gartner Recognizes ASAPP for Continuous Intelligence in CX

by 
Macario Namie
Article
Video
Jun 2
2 mins

Every year Gartner scans the horizons for companies who offer technology or services that are innovative, impactful, or intriguing. Gartner analysts might ask themselves: What’s something that customers could not do before? What technical innovation is focused on producing business impact? Or what new technology or service appears to be addressing systemic challenges?

This year’s Gartner report naming ASAPP as a “Cool Vendor” affirms our efforts at the intersection of artificial intelligence (AI) and customer experience (CX). We entered into this $600 billion industry because we wanted to create real change—building machine learning products that augment and automate the world’s workflows—and address the most costly and painful parts of CX that are largely ignored today.

Despite billions of dollars spent on technology designed to keep customers away from speaking with agents—starting with IVRs a few decades ago and most recently, chatbots—the human agent is still there. And in record numbers. Most large B2C organizations have actually increased their agent population over the last several years. And it is these human agents, the ones who represent your brand to millions of customers, who have been most ignored by innovators.

Macario Namie
By embracing automation—not as a replacement, but as augmentor—to human agents, the entire performance of sales and service contact centers is dramatically elevated.

Macario Namie

As ASAPP followers know well, this is why we exist. By embracing automation—not as a replacement, but as augmentor—to human agents, the entire performance of sales and service contact centers is dramatically elevated. Real-time continuous intelligence techniques are used to tell every agent the right thing to say and do, live during an interaction. The company benefits from radical increases in organizational productivity, while the customers get exactly what they want—the right answer in the fastest possible time.

We’re proud of the academic recognition ASAPP Research achieves for advancing the state of the art of automatic speech recognition (ASR), NLP, and Task-Oriented Dialogue. However, it’s the business results of this applied research that keeps ASAPP moving forward. We celebrate this Gartner recognition with our customers like American Airlines, Dish and JetBlue, who are seeing the business results of AI in their customer service.

So what makes a company applying artificial intelligence for customer experience a “Cool Vendor?” Well, check out the Gartner report. However, I would say it’s our exclusive focus on human performance within CX. Learn more by reading this year’s Gartner Cool Vendor report.

GARTNER DOES NOT ENDORSE ANY VENDOR, PRODUCT OR SERVICE DEPICTED IN ITS RESEARCH PUBLICATIONS, AND DOES NOT ADVISE TECHNOLOGY USERS TO SELECT ONLY THOSE VENDORS WITH THE HIGHEST RATINGS OR OTHER DESIGNATION. GARTNER RESEARCH PUBLICATIONS CONSIST OF THE OPINIONS OF GARTNER’S RESEARCH ORGANIZATION AND SHOULD NOT BE CONSTRUED AS STATEMENTS OF FACT. GARTNER DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT TO THIS RESEARCH, INCLUDING ANY WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Measuring Success
R&D Innovations
Speech-to-text
Transcription
Articles

Task-oriented dialogue systems could be better. Here’s a new dataset to help.

by 
Derek Chen
Article
Video
May 26
2 mins
Dialogue State Tracking has run its course. Here’s why Action State Tracking and Cascading Dialogue Success is next.

For call center applications, dialogue state tracking (DST) has traditionally served as a way to determine what the user wants at that point in the dialogue. However, in actual industry use cases, the work of a call center agent is more complex than simply recognizing user intents.

In real world environments, agents are typically tasked with strenuous multitasking. Tasks often include reviewing knowledge base articles, evaluating guidelines in what can be said, examining dialogue history with a customer, and inspecting customer account details all at once. In fact, according to ASAPP internal research, call center phone agents spend approximately 82 percent of their total time looking at customer data, step-by-step guides, or knowledge base articles. Yet none of these aspects are accounted for in classical DST benchmarks. A more realistic environment would employ a dual-constraint where the agent needs to obey customer requests while considering company policies when taking actions.

That’s why, in order to improve the state of the art of task-oriented dialogue systems for customer service applications, we’re establishing a new Action-Based Conversations Dataset (ABCD). ABCD is a fully-labeled dataset with over 10k human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by company policies to achieve task success.

The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests. With this dataset, we propose two new tasks: Action State Tracking (AST)—which keeps track of the state of the dialogue when we know that an action has taken place during that turn; and Cascading Dialogue Success (CDS)—a measure for the model’s ability to understand actions in context as a whole, which includes the context from other utterances.

Derek Chen
The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests.

Derek Chen

Dataset Characteristics

Unlike other large open-domain dialogue datasets often built for more general chatbot entertainment purposes, ABCD focuses deeper on increasing the count and diversity of actions and text within the domain of customer service. Dataset participants were additionally incentivized through financial bonuses when properly adhering to policy guidelines in handling customer requests, mimicking customer service environments and realistic agent behavior.

The training process to annotate the dataset, for example, at times felt like training for a real call center role. “I feel like I’m back at my previous job as a customer care agent in a call center,” said one MTurk agent who was involved in the study. “Now I feel ready to work at or interview for a real customer service role,” said another.

New Benchmarks

The novel features in ABCD challenges the industry to measure performance across two new dialogue tasks: Action State Tracking & Cascading Dialogue Success.

Action State Tracking (AST)

AST improves upon DST metrics by detecting the pertinent intent from customer utterances while also taking into account constraints from agent guidelines. Suppose a customer is entitled to a discount which will be offered by issuing a [Promo Code]. The customer might request 30% off, but the guidelines stipulate only 15% is permitted, which would make “30” a reasonable, but ultimately flawed slot-value. To measure a model’s ability to comprehend such nuanced situations, we adopt overall accuracy as the evaluation metric for AST.

Cascading Dialogue Success (CDS)

Since the appropriate action often depends on the situation, we propose the CDS task to measure a model’s ability to understand actions in context. Whereas AST assumes an action occurs in the current turn, the task of CDS includes first predicting the type of turn and its subsequent details. The types of turns are utterances, actions, and endings. When the turn is an utterance, the detail is to respond with the best sentence chosen from a list of possible sentences. When the turn is an action, the detail is to choose the appropriate slots and values. Finally, when the turn is an ending, the model should know to end the conversation. This score is calculated on every turn, and the model is evaluated based on the percent of remaining steps correctly predicted, averaged across all available turns.

Why This Matters

For customer service and call center applications, it is time for both the research community and industry to do better. Models relying on DST as a measure of success have little indication of performance in real world scenarios, and discerning CX leaders should look to other indicators grounded in the conditions that actual call center agents face.

Rather than relying on general datasets which expand upon an obtuse array of knowledge base lookup actions, ABCD presents a corpus for building more in-depth task-oriented dialogue systems. The availability of this dataset and two new tasks creates new opportunities for researchers to explore better, more reliable, models for task-oriented dialogue systems.

We can’t wait to see what the community creates from this dataset. Our contribution to the field with this dataset is another major step to improving machine learning models in customer service.

Read the Complete Paper, & Access the Dataset

This work has been accepted at NAACL 2021. Meet the authors on June 8th, 20:00—20:50 EST, where this work will be presented as a part of “Session 9A-Oral: Dialogue and Interactive Systems.”

No results found.
No items found.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get Started

AI Services Value Calculator

Estimate your cost savings

contact us

Request a Demo

Transform your enterprise with generative AI • Optimize and grow your CX •
Transform your enterprise with generative AI • Optimize and grow your CX •