Blog
The danger of only using containment rate to measure success
For many years, companies have measured the effectiveness of automated systems, such as their chatbot or IVR, by the system’s Containment Rate—the percent of interactions that don’t reach a human agent.
For digital chat programs, optimizing a bot to increase Containment Rate certainly has benefits. If some customers have their problems resolved in fully automated experiences without engaging an agent, then agents’ time will be freed up to assist other customers. Solving more customers’ issues in the bot means companies may require fewer employees to handle chat volume, resulting in cost savings.
What containment rate doesn’t measure
The problem with only measuring Containment Rate is that deflecting a customer doesn’t mean they’ve had their issue resolved. It simply means that a digital agent didn’t get involved in that particular interaction.
In my role as a Customer Experience Strategist at ASAPP, my team and I work with our customers to ensure that their customers are receiving the best possible experience when interacting with the brand. To optimize the customer’s experience, we need to ensure the metrics we are tracking and measuring are answering the key question “Did I resolve my customer’s issue?”
At ASAPP, when judging the effectiveness of a bot and making decisions to improve that effectiveness, we recommend using a metric called Flow Success—the number of conversations where the customer was provided with information necessary to address their issue without the need for a rep to get involved. Using this metric enables companies to understand when their containment is “good containment” and unlocks additional opportunities to optimize their bots towards a great experience.
Why flow success?
It is possible for a chatbot to have a high Containment Rate but a low Flow Success Rate. While this may represent potential cost savings for the company, this is an extremely frustrating experience for the end user.
Some automated flows require customers to take multiple, sometimes unnecessary, steps to find the solution to their problem. Other times, customers may be forced to log in to their account before they can get information when the solution could be provided to them on the phone without logging in. Sometimes customers may choose the wrong path in a flow and give up when they get information that isn’t relevant. These are all examples of “bad containment,” counting towards high Containment but low Flow Success.
In a best case scenario, the customer abandons the experience because they found the answer to their question elsewhere. However, there is a greater risk that the customer gets frustrated with their bot experience and calls instead, forcing involvement from a voice agent, ultimately increasing the cost to resolve that issue. Even worse, a customer may become so annoyed that they become a churn risk for the company. The loss in customer lifetime value can greatly outweigh the cost of having that customer interact with a digital agent.
When shifting the focus from Containment Rate to Flow Success, we are able to help our customers identify and fix areas where this may be happening.
It’s important for CX teams to understand not just whether their automation contained the customer, but whether the customer’s need was actually served.
- Bobby Kovalsky, Sr. Manager, Insights and Strategy, ASAPP
For example, when we analyzed a US cable company’s virtual assistant experience, we found a large gap in Containment and Flow Success for a billing intent. Customers asking to have their bill explained were often contained within the bot but were rarely provided with the information they wanted about their bill. Further analysis revealed that customers were frustrated by the amount of information they needed to provide the bot before the bot would give them their answer. To improve the experience, we recommended the company remove some of these steps, which we ultimately were unnecessary in determining the response the customer needed.
After the company implemented our optimization recommendation, the automated flow not only saw improved Flow Success but it ultimately led to greater Containment. The share of conversations with this intent that were Contained without Flow Success decreased by 21%. Because customers were easily able to access their information in the bot experience, they were less likely to ask to speak to an agent, leading to a 29% increase in Containment Rate.
What are the trade offs?
We’ve seen that organizations that focus on Flow Success rather than strictly Containment are able to create a better customer experience. However, this sometimes means customers will be able to more easily reach a representative. Increasing “good containment” and reducing “bad containment” does not always correlate with an increase in overall Containment.
For example, an internet service provider saw high levels of Containment when customers were asking if they were in an outage. After conducting an analysis to identify areas where customers were not being told whether or not they were in an outage, we found that the existing authentication process was causing customers to abandon the bot. We recommended the company revisit their existing process.
By simplifying the sign-in process, the company ultimately made it easier for customers to reach digital reps and Containment decreased by 3%. However, significantly more customers were informed about the status of their outage, leading to a 17% increase in Flow Success. This organization accepted the tradeoff, allowing more customers to reach digital agents knowing that the large dropoff from the previous sign-in process contributed to increased call volume and therefore higher overall costs.
Although these types of changes may lead to lower Containment, they will ultimately drive higher organizational throughput. By enhancing the digital experience, customers will be more likely to choose digital channels for their future contacts. As the contact mix shifts towards digital, companies unlock additional benefits unique to the channel, such as increasing concurrency, which enable them to handle more conversations with fewer representatives. This leads to larger cost savings than they would have achieved by preventing customers from reaching a digital rep.
What else should be considered?
Measuring Flow Success helps companies analyze and optimize their bot but it is not the only metric that matters. Companies may also want to consider the bot’s contribution to first contact resolution, call prevention, and customer satisfaction.
Creating the best bot experience requires companies to continuously evaluate and optimize performance. Those who focus on delivering the best customer experience in the bot rather than just lowering costs see long term benefits through increased customer satisfaction and higher digital adoption.
How do you know if ML-based features are really working?
It doesn’t take a rocket scientist (or a machine learning engineer) to know that customer service representatives need the right tools to do their job. If they don’t have them, both rep and customer satisfaction suffer.
As a Senior Customer Experience manager at ASAPP, I’ve spent the last several years partnering with some of the world’s largest consumer companies to improve their contact center operations and overall customer experience. As they adopt our technology, they’re eager to see the investment’s impact on operational performance. Measuring success typically starts with examining how agents interact with the system.
What happens when we empower agents with better technology? Do we create a positive feedback loop of happier employees and more satisfied customers? How can we track and learn from our best performers? Can usage be correlated with other important agent efficiency and success metrics? These are the types of questions my team seeks to answer with our customers as we evaluate a program’s augmentation rate.
Our suite of integrated automation features, including AutoSuggest, AutoComplete, and AutoPilot, use machine learning to augment agent activity. The system recommends what reps should say or do during the course of an interaction with a customer. The machine learning models improve with usage—which in the contact center space can be millions of interactions per month. We work with our customers to measure the impact of these technologies on their operations and KPIs through our augmentation rate metric, which evaluates the percentage of messages sent by agents that were suggested by our algorithms.
A recent analysis found that each time one of our customers’ agents used a suggested response instead of typing freehand, they saved ~15 seconds. The time savings added up fast.
Jonathan Rossi
Augmentation rate isn’t a common metric (yet). But it offers tremendous value as an indicator of how well the technology is being adopted, and therefore, the likelihood it will have an impact on performance outcomes.
From my experience, the top three things operators should know when utilizing this metric are:
- Iteration over time:
- Augmentation rate offers guidance on:
- How well the system is augmenting agent responses and learning through data;
- How well reps are trained and coached to use the tools available to them inside our platform.
- Both the system’s model and rep training can be calibrated and optimized to continually increase the effectiveness of these features.
- Workforce Management (WFM) Implications:
- The top-level augmentation metric is helpful in measuring overall program health, but looking at usage across groups and individuals can also be extremely informative for supervisors and coaches when assessing agent and cohort performance.
- We’ve found correlations between increased augmentation usage, AHT reduction, and improved CSAT for high-performing reps.
- Incentives matter.
- If you incentivize a workforce on this metric alone, there can be adverse effects. We’ve seen reps attempt to “game the system” by always using a suggested message, then editing the response before sending. This actually increases conversation duration and decreases productivity compared to not using the features in the first place.
- Augmentation should be one of multiple metrics that go into agent performance incentives (alongside others like CSAT, throughput, and resolution rate).
By studying augmentation rates at customer companies, we’ve been able to see exactly where agents get the most benefit from integrated automation and where pain points still exist. From that knowledge, ASAPP has begun building new capabilities to increase the impact ML can have on modern workforces. For example:
- Our product team is developing additional AutoPilot features (like AutoPilot Greetings) that will automate the beginning of conversations, so reps can focus on the “meat” of an interaction and better assist the customer.
- We know that both agents and customers prefer personalized conversations. Our research and product teams are tackling this problem in two ways. First, we incorporate custom responses into our platform, enablinging reps to curate a repository of preferred messages to send to customers. This allows for agents to use suggestions in their own voice. Second, as we get more malleable in leveraging customer-specific data throughout our platform, we’re embedding more personalized customer information directly into these suggestions.
Early feedback on these additions to our augmentation features have been overwhelmingly positive from both agents and operators. Like our machine learning models, we aim to iteratively improve our product capabilities over time through usage and impact analysis, working with our customers to radically increase rep satisfaction and efficiency—which ultimately benefits the customer experience.
How do you find automation workflows for your contact center?
Automating common tasks and enabling self-service issue resolution for customers is an essential part of any online customer service experience. These automated flows directly address a specific well-scoped problem for the customer, getting them to resolution quicker and freeing up agents to handle more complex issues. But, automation doesn’t have to be an all or nothing proposition. At ASAPP, we automate flows before, during, and after agent interactions, increasingly reducing agent workload and growing the opportunity for self service over time.
Discovering and prioritizing new flows and understanding what’s needed for successful automation, however, can be challenging. It is often a time consuming and labor intensive process. ASAPP has developed AI Native® approaches to surface these workflows to humans, and we’ve been awarded a patent, “Identifying Representative Conversations Using a State Model” for a powerful solution we developed to perform flow induction.
It’s difficult for a human to imagine all the possible conversation patterns that could be automated, and which ones are most important to automate. It’s important to consider things like how many users it would affect, how much agent time is being spent on the intent, whether the flow has a few well-defined paths or patterns, what value the intent brings to the business, and whether there are any overlaps between this intent and other conversations.
Rather than manually sifting through all the data, an analyst can leverage patterns identified by the model to more quickly deploy automated workflows and evaluate their potential with real usage data.
Michael Griffiths
We call the process of automatically discovering and distilling the conversational patterns—“workflows”, or “flows” for short—flow induction. We can condense a large collection of possible flows to a much smaller number of representative flows. These induced flows best capture interactions between customers and agents, and flags where automation can lend a helping hand. This facilitates faster and more comprehensive creation of automated flows, saving time and money.
Our patented approach for flow induction begins by representing each part of a conversation mathematically, capturing its state at the time. As a simple example, we would want the start of each conversation—where agents say “hello” or “how are you” or “welcome to X company”—to be similar, with approximately the same state representation. We can then trace the path the conversation traces as it progresses from start to finish. If the state is two dimensional, you could draw the line that each conversation takes as its own “journey.” We then group similar paths and identify recurring patterns within and across conversations.
The process of identifying automation use cases is dramatically simplified with this representation. Instead of manually sifting through conversations, talking to experienced agents, or listening into calls to do journey mapping—the analyst can dive into a pattern the model has identified and review its suitability for automation. Even better, because ASAPP is analyzing every customer interaction, we know how many customers are affected by the flows and what the outcomes (callback, sales conversion, etc) are — making prioritization a breeze.
ASAPP deploys “flows” like this across our platform. By identifying the recurring work that agents are handling an analyst can construct integrated flows for agents to serve in any part of a conversation. And over time, more and more flows can be sent directly to the customer so they can self-serve. Once deployed every flow becomes part of a virtuous feedback loop, where usage informs how impactful the automation is for our customers and their customers. This process informs both new flow opportunities and refinements to existing flows.
AutoSummary's 3R Framework Raises the Bar for Agent Call Notes
Taking notes after a customer call is essential for ensuring that key details are recorded and ready for the next agent, yet it can be difficult to prioritize when agents have other tasks competing for their time. Could automated systems help bridge this gap while still delivering high-quality information? How should the data from customer interactions be organized so that it is useful and easily accessed in the future?
As we were developing AutoSummary, the ASAPP AI Service for automating call dispositioning, we asked our customers for input. ASAPP conducted customer surveys and discovered that agent notes needed to include Reason, Resolution, and Result for every conversation. This 3R Framework was key to success. Here’s a more detailed explanation:
- Reason – Agent notes need to focus on the reason for the customer interaction. This crucial bit of data, if accurately noted, immediately helps the next agent assisting the same customer with their issue. They’re able to dig into earlier details and resolve issues more quickly and efficiently while also impressing customers with their empathy and understanding of the situation.
- Resolution – It is essential to document the steps taken toward resolution if an agent needs to continue where another left off. When an agent clearly understands the problem and its context, it becomes much easier to follow a series of steps or flowcharts to resolve.
- Result – All interactions have a result that should be documented. This allows future customer service agents to see whether the problem was solved effectively, as well as any other important details.
ASAPP designed AutoSummary to automate dispositioning using the 3R framework as a foundation. And, depending on the needs of the customer, AutoSummary can also provide additional information, like an analytics-ready structured representation of the steps taken during a call. We created AutoSummary with two goals in mind:
- Maintain a high bar for what’s included: A summary is, essentially, a brief explanation of the main points discussed during an interaction. Although summaries lengthen as conversations continue, we maintain a limit so that agents can read the note and become caught up in 10-20 seconds. We also eliminate any data that could be superfluous or inaccurate. Our strict standards guarantee a quality output while still being concise.
- Engineer for model improvement: While AutoSummary creates excellent summaries, a fundamental component of all ASAPP’s AI services is the power to rapidly learn from continuous usage. We designed a feedback system and work with our customers so that any changes agents make to the generated notes are fed back into our models. Thus, we’re constantly learning from what the agents do – and over time, as the model improves, we receive fewer modifications.
We’re always learning what our customers want and translating that into effective product design. For us, it’s been great to see how successful these summaries are in terms of business metrics such as customer satisfaction, brand loyalty, and agent retention. We strongly believe that good disposition notes for all customer interactions improve every metric mentioned above–and more!
On average, our customers who use Autosummary save over a minute of call handling time per interaction, which saves them millions of dollars a year. Who wouldn’t want those kinds of results?
Utilizing Pre-trained Language Model for Speech Sentiment Analysis
The future of real-time speech sentiment analysis shows promise in offering new capabilities for organizations seeking to understand how customers feel about the quality of service received across customer service interactions. By understanding customer sentiment the moment customers say it, organizations are equipped with the intelligence to make nimble changes in service. To date, customer feedback surveys fulfilled this purpose but present with some known limitations.
In addition to the low percentage of customers who fill out surveys, customer feedback surveys have a problem with bias: customers are more likely to respond to a survey when having either a positive or negative experience, thus heavily skewing results to positive and negative feedback. With low response rates and biased results, it’s hard to argue that surveys provide a complete picture of the customer experience. Helping fill out this picture, future speech sentiment analysis capabilities offer another way for organizations to evaluate all of the interactions a customer has.
By collecting more information from every call (and not just a few polarized survey responses), speech sentiment could be a way to reduce bias and provide a more comprehensive measure of the customer experience. Future capabilities, which can measure real-time attitude and opinion regarding the service customers receive, can equip organizations with intelligence to make swift shifts in agent coaching or experience design. As more contact center agents work from home, access to live sentiment insight could be a great way for supervisors to support agents on a moment’s whim without needing to be in the same office.
Current methods in speech sentiment analysis are bringing us closer to realizing these real-time sentiment analysis capabilities, but several research hurdles remain in acquiring the right dataset to train these models. Medhat et. al 2014 illustrate how current NLP sentiment data comes in the form of written text reviews, but this is not the right kind of data needed for speech analysis of conversational recordings. Even when audio data is available, it often arrives in limited scripted conversations repeated from a single actor or monologue–which is insufficient for sentiment analysis on natural conversations.
As we work to advance the state of the art in speech sentiment analysis, new ASAPP research presented at Interspeech 2021 is making progress in lowering these barriers.
The Conventional Approach
While ASAPP’s automatic speech recognition (ASR) system is a leader in speech-to-text performance, conventional methods of using cascading ASR and text-based natural language processing (NLP) sentiment analysis systems have several drawbacks.
Large language models trained on text-based examples for sentiment analysis show a large drop in accuracy when applied to transcribed speech. Why? We speak differently than how we write. Spoken language and written language lie in different domains, so the language model trained on written language (e.g. BERT was trained using BooksCorpus and Wikipedia) does not perform well on spoken language input.
Furthermore, abstract concepts such as sarcasm, disparagement, doubt, suspicion, yelling, or intonation further complicate the complexity of speech sentiment recognition over an already challenging task of text-based sentiment analysis. Such systems lose rich acoustic/prosodic information which is critical to understanding spoken language (such as changes in pitch, intensity, raspy voice, speed, etc).
Speech annotation for training sentiment analysis models has been offered as a way to overcome this obstacle for controlled environments [Chen et. al, 2020], but is costly in collection efforts. While publicly available text can be found virtually everywhere–from social media to English literature, acquiring conversational speech with the proper annotations is harder given limited open-source availability. And, unlike sentiment-annotated text, speech annotations have to require more time listening to the speech.
ASAPP Research: Leveraging Pre-trained Language Model for Speech Sentiment Analysis
Leveraging pre-training neural networks is a popular way to save the annotation resource on downstream tasks. In the field of NLP, great advances have been made through pre-training task-agnostic language models without any supervision, e.g. BERT. Similarly, in the study of Spoken Language Understanding (SLU), pre-training approaches were proposed in combination with ASR or acoustic classification modules to improve SLU performance under limited resources.
The aforementioned pre-training approaches only focus on how to pre-train the acoustic model effectively with the assumption that if a model is pre-trained to recognize words or phonemes, the fine-tuning result of downstream tasks will be improved. However, they did not consider transferring information from the language model that had already been trained with a lot of written text data to the conversational domain.
We propose the use of powerful pre-trained language models to transfer more abstract knowledge from the written text-domain to speech sentiment analysis. Specifically, we leverage pre-trained and fine-tuned BERT models to generate pseudo labels to train a model for the end-to-end (E2E) speech sentiment analysis system in a semi-supervised way.
For the E2E sentiment analysis system, a pre-trained ASR encoder is needed to prevent overfitting and encode speech context efficiently. To transfer the knowledge from the text domain, we generated pseudo sentiment labels from either ASR transcript or ground truth human transcript. The pseudo labels can be used to pre-train the sentiment classifier in the semi-supervised training phase. In the fine-tuning phase, the sentiment classifier can be trained with any speech sentiment dataset we want to use. Target domain matched speech sentiment dataset would give the best result in this phase. We verified our proposed approach using a large scale Switchboard sentiment dataset [Chen et al. 2020].
Transfer learning between spoken and written language domains was not actively addressed before. This work found that pseudo sentiment labels obtained from a pre-trained model trained in the written text-domain can transfer the general sentiment knowledge into the spoken language domain using a semi-supervised training framework. This means that we can train the network more efficiently with less human supervision.
Suwon Shon, PhD
Why this matters
Transfer learning between spoken and written language domains was not actively addressed before. This work found that pseudo sentiment labels obtained from a pre-trained model trained in the written text-domain can transfer the general sentiment knowledge into the spoken language domain using a semi-supervised training framework.
This means that we can train the network more efficiently with less human supervision. From the experiment in Figure 3 we can save about 65% (30h vs. 86h) of human sentiment annotation using our pseudo label-based semi-supervised training approach. On the other hand, this also means that we can boost the performance of sentiment analysis when we use the same amount of sentiment annotated training set. We observe that the best system showed about 20% improvement on unweighted F1 score (57.63%) on the evaluation set compared to the baseline (48.16%).
Lastly, we observed that using ASR transcripts for pseudo labels gives a slight performance degradation, but still shows better performance than the baseline. This result allows us to use a huge unlabeled speech for a semi-supervised training framework without any human supervision.
Multi-mode ASR: Increasing Robustness with Dynamic Future Contexts
Automatic Speech Recognition (ASR), as its name indicates, is a technology tasked with deriving text transcriptions from auditory speech. As such, ASR is the backbone that provides real-time transcriptions for downstream tasks. This includes critical machine learning (ML) and natural language processing (NLP) tools that help human agents reach optimal performance. Downstream ML/NLP examples include auto-suggest features for an agent based on what a customer is saying during a call, creating post-call summary notes from what was said, or intent classification, i.e. knowing what a customer is calling for to pair them with the most appropriate agent. Crucial to the success of these AI systems is the accuracy of speech transcriptions. Only by accurately detecting what a customer or agent is saying in real-time, can we have AI systems provide insights or automate tasks accordingly.
A key way to improve this accuracy is to provide more surrounding speech information to the ASR model. Rather than having an ASR model predict what a speaker is saying only based on what’s said before, by also using what’ll be said as future context, is a model able to better predict and detect the difference between someone who said: “I’m going to the cinema today [to watch the new James Bond]” versus “I’m going to the cinema to date… [James Bond].” When we predict words, using speech frames from future utterances gives more context. And by utilizing more context, some of the errors which emerge from the limitation of the method relying on past context only can be fixed.
A downside to the increased accuracy of the longer contextual approach with future speech frames comes with a trade-off in latency and speed for waiting and computing future frames. Latency constraints vary depending upon services and applications. People usually train the best model at a given latency requirement. You would compromise the accuracy of an ASR model if it were used in a different latency condition from the one incurred to model training. To meet various scenarios or service requirements with this approach thus means that several different models would have to be trained separately—making development and maintenance difficult, which is a scalability issue.
At ASAPP, we require the highest accuracy and lowest latency to achieve true real-time insights and automation. However, given our diverse product offerings with different latency requirements, we also try to address the scalability issue efficiently. So to overcome this challenge, research accepted at Interspeech 2021 takes a new approach with an ASR model that dynamically adjusts its latency based on different constraints without the accuracy compromise, which we refer to as Multi-mode ASR.
The ASAPP Research: A Multi-mode Transformer Transducer with Stochastic Future Context
Our work expands upon previous research on dual-mode ASR (Yu et al., 2020). A Transformer model has the same structure for both the full context model and the streaming model: the full context model uses unlimited future context and the streaming model uses limited future context (e.g., 0 or 1 future speech frame per each neural layer, where a frame requires 10ms speech and we use 12 layers). The only difference is that self-attention controls how many future frames the model would access by masking the frames. Therefore, it is possible to operate the same model in full context and streaming mode. Additionally, we can use “knowledge distillation” when training the streaming mode. That is, we train the streaming mode not only on its original objective, but also to have outputs that are similar to the ones produced by the full context mode. This way, we can further bridge the gap between streaming and full context modes. This method remarkably improves the problem of accuracy drop and alignment delay of streaming ASR. We were directly motivated by this method and have been studying to extend it to multiple modes.
Our multi-mode ASR is similar to dual-mode but it is broader and more general. We didn’t limit the streaming mode to a single configuration using only one future context size, but defined it as using a stochastic future context size instead. As described in Figure 1 below, dual-mode ASR is trained on a predefined pair consisting of the full context mode and the zero context (streaming) mode. In contrast, multi-mode ASR trains a model using multiple pairs of the full context mode and the streaming mode with a future context size of C where C is sampled from a stochastic distribution.
Since C is selected from a distribution for every single minibatch during training, a single model is trained on various future context conditions.
We say that evaluation conditions are matched when the training context size and the inference context size are the same, and that they are mismatched otherwise. The results in Table 1 show that a streaming model only works well when it’s matched, i.e., trained and evaluated on past speech alone. . Although the results for the dual-mode trained model are better than the trained-alone model—a result of the knowledge distillation, it also doesn’t work well in the mismatched condition. Contrary to this, it can be confirmed that our proposed multi-mode trained model operates reliably in multiple conditions, because the mismatched condition is eliminated by using a stochastic future context. Looking at the detailed results for each context condition, it can be expected that training for this stochastic future context also can bring regularization effects to a single model.
Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.
Kwangyoun Kim
Why this matters
ASR is used in services with various environments and scenarios. To create downstream ML and NLP tasks that produce results within seconds and work well with human workflows, ASAPP’s ASR model must similarly operate in milliseconds based on the situation. Rather than developing and maintaining multiple ASR models that work under varying levels of time constraints or conditions, we’ve introduced a single multi-mode model that can dynamically adjust to various environments and scenarios.
By exposing a single model to various conditions, one model can have the ability to change the amount of used future context needed to meet the latency requirements for a particular application. This makes it easier and more resource-efficient to cover all different scenarios. Thinking further, if the latency is increased due to unpredictable load in service, it is possible to change the configuration easily on the fly, and it is viable to significantly increase the usability with minimal accuracy degradation. Algorithms for responding to multiple scenarios usually suffer sub-optimal performance problems compared to a model optimized for one condition. But multi-mode ASR shows the possibility that it can easily cover multiple conditions without such problems.
What’s next for us at ASAPP
The paper about this study will be presented at Interspeech 2021 (Wed, Sep 1st, 11:00 ~ 13:00, GMT +2). The method and detailed results are described in that paper. We believe that this research topic is one of the promising directions to effectively support various applications, services, and customers. Research is also underway to extend this method to train a general model by combining it with pre-training methods. We will continue to focus on research on scalability as an important factor in terms of model training and deployment.
Introducing CLIP: A Dataset to Improve Continuity of Patient Care with Unsupervised NLP
Continuity of care is crucial to ensuring positive health outcomes for patients, especially in the transition from acute hospital care to out-patient primary care. However, information sharing across these settings is often imperfect.
Hospital discharge notes alone easily top thousands of words and are structured with billing and compliance in mind, rather than the reader, making poring over these documents for important pending actions especially difficult. Compounding this issue, primary care physicians (PCPs) already are short on time—receiving dozens of emails, phone calls, imaging, and lab reports per day (Baron 2010). Lost in this sea of hospital notes and time constraints are important actions for improving patient care. This can cause errors and complications for both patients and primary care physicians.
Thus, in order to improve the continuity of patient care, we are releasing one of the largest annotated datasets for clinical NLP. Our dataset, which we call CLIP, for CLInical Follow-uP, makes the task of action item extraction tractable, by enabling us to train machine learning models to select the sentences in a document that contain action items.
By leveraging modern methods in unsupervised NLP, we can automatically highlight action items from hospital discharge notes and action items for primary care physicians–saving them time and reducing the risk that they miss critical information.
James Mullenbach
We view the automatic extraction of required follow-up action items from hospital discharge notes as a way to enable more efficient note review and performance for caregivers. In alignment with the ASAPP mission to augment human activity by advancing AI, this dataset and task provide an exciting test ground for unsupervised learning in NLP. By automatically surfacing relevant historical data to improve communication, this work represents another key way ASAPP is improving human augmentation with AI. In our ACL 2021-accepted paper, we demonstrate this with a new algorithm.
The CLIP Dataset
Our dataset is built upon MIMIC-III (Johnson et al., 2016), a large, de-identified, and open-access dataset from the Beth Israel Deaconess Medical Center in Boston, which is the foundation of much fruitful work in clinical machine learning and NLP. From this dataset, with the help of a team of physicians, we labeled each sentence in 718 full discharge summaries, specifying whether the sentence contained a follow-up action item. We also annotated 7 types to further classify action items by the type of action needed; for example, scheduling an appointment, following a new medication prescription, or reviewing pending laboratory results. This dataset, comprising over 100,000 annotated sentences, is one of the largest open-access annotated clinical NLP datasets to our knowledge, and we hope it can spur further research in this area.
How well does machine learning accomplish this task? In our paper we approach the task as sentence classification, individually labeling each sentence in a document with its followup types, or “No followup”. We evaluated several common machine learning benchmarks on the task, adding some tweaks to better suit the task, such as including more than one sentence as input. We find that the best models, based on the popular transformer-based model BERT, provide a 30% improvement in F1 score, relative to the linear model baseline. The best models achieve an F1 score around 0.87, close to the human performance benchmark of 0.93.
Model pre-training for healthcare applications
We found that an important factor in developing effective BERT-based models was pre-training them on appropriate data. Pre-training exposes models to large amounts of unlabeled data, and serves as a way for large neural network models to learn how to represent the general features of language, like proper word ordering and which words often appear in similar contexts. Models that were pre-trained only on generic data from books or the web may not have enough knowledge on how language is used specifically in healthcare settings. We found that BERT models pre-trained on MIMIC-III discharge notes outperformed the general-purpose BERT models.
For clinical data, we may want to take this focused pre-training idea a step further. Pre-training is often the most costly step of model development due to the large amount of data used. But, can we reduce the amount of data needed, by selecting data that is highly relevant to our end task? In healthcare settings, with private data and less computational resources, this would make automating action item extraction more accessible. In our paper, we describe a method we call task-targeted pre-training (TTP) that builds datasets for pre-training by selecting sentences that look the most like those in our annotated data that do contain action items. We find that it’s possible, and maybe even advantageous, to select data for pre-training in this way, saving time and computational resources while maintaining model performance.
Improving physician performance and reducing cognitive load
Ultimately, our end goal is to make physicians’ jobs easier by reducing the administrative burden of reading long hospital notes, and bring their time and focus back where it belongs: on the patient. Our methods can condense notes down to what a PCP really needs to know, reducing note size by at least 80% while keeping important action items readily available. This reduction in “information overload” can reduce physicians’ likelihood of missing important information (Singh et al., 2013), improving their accuracy and the well-being of their patients. Through a simple user interface, these models could enable a telemedicine professional to more quickly and effectively aid a patient that recently visited the hospital.
Read more and access the data
Our goal with open sourcing CLIP is to enable lots more future work in this area of summarizing clinical notes and reducing physician workload, with our approach serving as a first step. We anticipate that further efforts to incorporate the full document into model decisions, exploit sentence-level label dependencies, or inject domain knowledge will be fruitful. To learn more, visit our poster session at ACL occurring Monday, Aug. 2, 11:00 a.m.—1:00 p.m. ET.
Paper
CLIP Dataset
Code Repository
Citations
Richard J. Baron. 2010. What’s keeping us so busy in primary care? a snapshot from one practice. The New England Journal of Medicine, 362 17:1632–6.
Hardeep Singh, Christiane Spitzmueller, Nancy J. Petersen, Mona K. Sawhney, and Dean F. Sittig. 2013. Information overload and missed test results in electronic health record-based settings. JAMA Internal Medicine, 173 8:702–4.
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li wei H. Lehman, Mengling Feng, Mohammad M. Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. Mimic-III, a freely accessible critical care database. In Scientific Data
Why I joined ASAPP: Taking AI to new levels in enterprise solutions
I have spent the past 20 years working in natural language processing and machine learning. My first project involved automatically summarizing news for mobile phones. The system was sophisticated for its time, but it amounted to a number of brittle heuristics and rules. Fast forward two decades and techniques in natural language processing and machine learning have become so powerful that we use them every day—often without realizing it.
After finishing my studies, I spent the bulk of these 20 years at Google Research. I was amazed at how machine learning went from a promising tool to one that dominates almost every consumer service. At first, progress was slow. A classifier here or there in some peripheral system. Then, progress came faster, machine learning became a first class citizen. Finally, end-to-end learning started to replace whole ecosystems that a mere 10 years before were largely based on graphs, simple statistics and rules-based systems.
After working almost exclusively on consumer facing technologies. I started shifting my interests towards enterprise. There were so many interesting challenges that arose in this space. The complexity of needs, the heterogeneity of data and often the lack of clean, large-scale training sets that are critical to machine learning and natural language processing. However, there were properties that made enterprise tractable. While the complexity of tasks was high, the set of tasks any specific enterprise engaged in was finite and manageable. The users of enterprise technology are often domain experts and can be trained. Most importantly, these consumers of enterprise technology were excited to interact with artificial intelligence in new ways— if it could deliver on its promise to improve the quality and efficiency of their efforts.
This led me to ASAPP.
I am firm in my belief that to take enterprise AI to the next level a holistic approach is required. Companies must focus on challenges with systemic inefficiencies and develop solutions that combine domain expertise, machine learning, data science and user experience (UX) in order to elevate the work of practitioners. The goal is to improve and augment sub-tasks that computers can solve with high precision in order to enable experts to spend more time on more complex tasks. The core mission of ASAPPis exactly in line with this, specifically directed towards customer service, sales and support.
To take enterprise AI—and customer experience—to the next level a holistic approach is required.
Ryan McDonald, PhD
The customer experience is ripe for AI to elevate to the next level. Everyone has experienced bad customer service, but also amazing customer service. How do we understand choices that the best agents make? How do we recognize opportunities where AI can automate routine and monotonous tasks? Can AI help automate non deterministic tasks? How can AI improve the agent experience leading to less burn out, lower turnover and higher job satisfaction? This is in an industry that employs three million people in the United States alone but suffers from an average of 40 percent attrition—one of the highest rates of any industry.
ASAPP is focusing its efforts singularly on the Customer Experience and there are enough challenges here to last a lifetime. But, ASAPP also recognizes that this is the first major step on a longer journey. This is evident in the amazing research group that ASAPP has put together. They are not just AI in name, but also in practice. Our research group consists of machine learning and language technology leaders, many of whom publish multiple times a year. We also have some of the best advisors in the industry from universities like Cornell and MIT. This excites me about ASAPP. It is the perfect combination of challenges and commitment to advanced research that is needed in order to significantly move the needle in customer experience. I’m excited for our team and this journey.
To realize Forrester’s vision of conversational intelligence, a human focus is needed.
For the CX industry, success always relied on an ability to deliver high-quality customer interactions at scale. The availability of omnichannel opened up new, convenient, avenues for customers to engage with organizations, yet it also increased the volume of interactions needing resolution. But thanks to modern advances in AI research, conversational and speech intelligence is having a new renaissance moment to improve CX revenue and efficiency at this rising scale.
As proof of this trend, Forrester Research released their new Q2 2021 report, “Now Tech: Conversation Intelligence” which names ASAPP among the leading conversation intelligence providers. The report guides forward-looking organizations to harness conversational intelligence in three key areas:
- Delivering CX insights at scale.
- Solutions which help organizations understand the voice of the customer and the agent at every interaction.
- Improving CX behavior at scale.
- Solutions which monitor and guide agents on what to say, actions to take, or areas to coach an agent.
- Accelerating revenue.
- Solutions which give sales teams insights they need to drive a greater volume of better leads and to ensure they are acted upon.
In looking at these areas, it’s no surprise that organizations like American Airlines and DISH are turning to ASAPP for real-time insights that empower customer and sales agents to achieve peak performance. At ASAPP, we believe that intelligence is best deployed at the point of where it matters: in real time where the interactions occur. It’s why we’re committed to advancing true AI which is redefining automation in contact center operations to triple throughput, increase digital adoption and transform contact center operations with lower operational costs.
Real-time insights from a continuously learning system improve a company’s ability to deliver highly personalized customer experiences—and substantially improve efficiency at the same time.
Macario Namie
This real time conversational Intelligence replaces yesterday’s rules-based systems by capitalizing on the insights of your agents and customers. A rules-based system, whether it feeds chatbots or humans, only captures a fraction of the available knowledge and doesn’t take advantage of the lessons learned from today’s data pools. Rigid rules-based systems aren’t flexible or generalizable for diverse customer needs, as no rules-based system will deliver customized, real-time intelligence that equip agents in what to say the moment actions occur.
It’s time for us to harness conversational intelligence that applies the knowledge of agents at scale. CX leaders who utilize a combination of conversational intelligence and automation understand how this leads to better customer voice and digital experiences that increase Customer Satisfaction (CSAT) and Net Promoter Scores (NPS). It’s why organizations that deploy ASAPP see an exponential improvement in performance that delivers measurable results in less than 60 days.
That’s all to say that we’re proud to see further recognition of ASAPP’s value in conversational intelligence. The Forrester Research report builds on our distinction as a “Cool Vendor” by Gartner. How are you thinking of using conversation intelligence at your organization?
Read the full report by Forrester Research for more details.
See the press release here.