Blog
Preventing hallucinations in generative AI agent: Strategies to ensure responses are safely grounded
The term “hallucination” has become both a buzzword and a significant concern. Unlike traditional IT systems, Generative AI can produce a wide range of outputs based on its inputs, often leading to unexpected and sometimes incorrect responses. This unpredictability is what makes Generative AI both powerful and risky. In this blog post, we will explore what hallucinations are, why they occur, and how to ensure that AI responses are safely grounded to prevent these errors.
What is a Hallucination?
Definition and Types
In the context of Generative AI, a hallucination refers to an output that is not grounded in the input data or the knowledge base the AI is supposed to rely on. Hallucinations can be broadly categorized into two types:
- Harmless Hallucinations: These are errors that do not significantly impact the user experience or the integrity of the information provided. For example, an AI might generate a slightly incorrect but inconsequential detail in a story.
- Harmful Hallucinations: These are errors that can mislead users, compromise brand safety, or result in incorrect actions. For instance, an AI providing incorrect medical advice or financial information falls into this category.
Two Axes of Hallucinations
To better understand hallucinations, we can consider two axes. Justification - whether the AI had information indicating that its statement was true. And truthfulness - whether the statement was actually true.
Based on these axes, we can classify hallucinations into four categories:
- Justified and True: The ideal scenario where the AI’s response is both correct and based on available information.
- Justified but False: The AI’s response is based on outdated or incorrect information. This can be fixed by improving the information available. For example, if an API response is ambiguous, it can be updated so that it is clear what it is referring to.
- Unjustified but True: The AI’s response is correct but not based on the information it was given. For example, the AI telling the customer that they should arrive at an airport 2 hours before a flight departure. If this was not grounded in, say, a knowledge base article, then this information is technically a hallucination even if it is true.
- Unjustified and False: The worst-case scenario where the AI’s response is both incorrect and not based on any available information. These are harmful hallucinations that could require an organization to reach out to the customer to fix the mistake.
Why Do Hallucinations Occur?
Hallucinations in generative AI occur due to several reasons. Generative models are inherently stochastic, meaning they can produce different outputs for the same input. Additionally, the large output space of these models increases the likelihood of errors, as they are capable of generating a wide range of responses. AI systems that rely on incomplete or outdated data are also prone to making incorrect statements. Finally, the complexity of instructions can result in misinterpretation, which may cause the model to generate unjustified responses.
Hallucination prevention and management
We typically think about four pillars when it comes to preventing and managing hallucinations:
- Preventing hallucinations from occurring in the first place
- Catching hallucinations before they reach the customer
- Tracking hallucinations post-production
- Improving the system
1. Preventing Hallucinations: Ensuring Responses are Properly Grounded
One of the most effective ways to prevent hallucinations is to ensure that AI responses are grounded in reliable data. This can be achieved through:
- Explicit Instructions: Providing clear and unambiguous instructions can help the AI interpret and respond accurately.
- API Responses: Integrating real-time data from APIs ensures that the AI has access to the most current information.
- Knowledge Base (KB) Articles: Relying on a well-maintained knowledge base can provide a solid foundation for AI responses.
2. Catching hallucinations: Approaches to Catch Hallucinations in Production
To minimize the risk of hallucinations, several safety mechanisms can be implemented:
- Input Safety Mechanisms: Detecting and filtering out abusive or out-of-scope requests before they reach the AI system can prevent inappropriate responses.
- Output Safety Mechanisms: Assessing the proposed AI response for safety and accuracy before it is delivered to the user. This can involve:some text
- Deterministic Checks using rules-based systems to ensure that certain types of language or content are never sent to users, or
- Classification Models, which employ machine learning models to classify and filter out potentially harmful responses. For example, a model can classify whether the information contained in a proposed response has been grounded on information retrieved from the right data sources. If this model suspects that the information is not grounded, it can reprompt the AI system to try again with this feedback.
3. Tracking hallucinations: post-production
A separate post-production model can be used to classify AI responses as mistakes in more detail. While the “catching hallucination” model should balance effectiveness with latency, the post-production mistake monitoring model can be much larger, as latency is not a concern.
A well-defined hallucination taxonomy is crucial for systematically identifying, categorizing, and addressing errors in Generative AI systems. By having a well-defined error taxonomy system, users can aggregate reports that make it easy to identify, prioritize, and resolve issues quickly.
The following categories help identify the type of error, its source of misinformation, and the impact.
- Error Category -broad classification of types of the generative AI system errors.
- Error Type - specific nature or cause of an AI system error.
- Information Source - origin of data used by the AI system.
- System Source - component responsible for generating or processing AI output.
- Customer Impact Severity - level of negative effect on the customer.
4. Improving the system
Continuous improvement is crucial for managing and reducing hallucinations in AI systems. This involves several key practices. Regular updates ensure that the AI system is regularly updated with the latest data and information. Implementing feedback loops allows for the reporting and analysis of errors, which helps improve the system over time. Regular training and retraining of the AI model are essential to enable it to adapt to new data and scenarios. Finally, human oversight involving contact center supervisors to review and correct AI responses, especially in high-stakes situations, is critical.
Conclusion
By understanding the nature of hallucinations and implementing robust mechanisms to prevent, catch, and manage them, organizations can harness the power of Generative AI while minimizing risks. Just as human agents in contact centers are managed and coached to improve performance, Generative AI systems can also be continually refined to ensure they deliver accurate and reliable responses. By focusing on grounding responses in reliable data, employing safety mechanisms, and fostering continuous improvement, we can ensure that AI responses are safely grounded and free from harmful hallucinations.
Want more? Download the practical guide to safety with generative AI agents in the contact center
Some brands are already realizing substantial value with generative AI agents. And some have experienced very public failures.
What’s the difference? Understanding how to deploy genAI agents safely.
Need a crash course on genAI safety? Get the ultimate guide here. You’ll discover:
- What causes hallucinations and other AI errors
- Steps you can take to manage risk with genAI agents
- The safety measures you should expect from you AI vendor
- How to realize value and protect your business
Building enterprise-grade AI solutions for CX in a post-LLM world
The advent of commercially available cloud-hosted large language models (LLMs) such as OpenAI’s GPT-4o and Anthropic’s Claude family of models has completely changed how we solve business problems using machine learning. Large pre-trained models produce state-of-the-art results on many typical natural language processing (NLP) tasks such as question answering or summarization out-of-the-box. Agentic AI systems built on top of these LLMs incorporate tool usage and allow LLMs to perform goal-directed tasks and manipulate data autonomously, for example by using APIs to connect to the company knowledge base, or querying databases.
The extension of these powerful capabilities to audio and video has already started, and will endow these models with the ability to make inferences or generate multi-modal output (produce responses not only in text, but also audio, image, and video) based on complex, multi-modal input. As massive as these models are, the cost and the time it takes to produce outputs is decreasing rapidly, and this trend is expected to continue.
While the barrier to entry is now much lower, applying AI across an enterprise is no straightforward task. It’s necessary to consider hidden costs and risks involved in managing such solutions for the long-term. In this article, we will explore the challenges and considerations involved in building your own enterprise-grade AI solutions using LLMs, and look at a contact center example.
The first step of a long journey
LLMs have served to raise the level of abstraction at which AI practitioners can create solutions. Tremendously so. Problems that took immense time, skill, experience and effort to solve using machine learning (ML) are now trivial to solve on a laptop. This makes it tempting to think of the authoring of a prompt as being equivalent to the development of a feature, replacing the prior complex and time-consuming development process.
However, showing that something can be done isn’t quite the same as a production-worthy, scalable solution to a business problem.
For one, how do you know you have solved the problem? It is common practice to create a high-quality dataset for training and evaluating the model. While the creation of such a benchmark takes time and effort, it is standard practice. Since LLMs come pretrained, such a dataset is no longer needed for training. Impressive looking results can be obtained with little effort - but without anchoring results to data, the actual performance is unknown.
Creating the evaluation methodology for a generative model is much harder because the potential output space is tremendous. In addition, evaluation of output is much more subjective. These problems can be addressed, for example, using techniques such as “LLM as a judge”, but the evaluation becomes a demanding task - as the evaluator itself becomes a component to build, tune and maintain.
The need to continuously monitor quality requires the creation of complex data pipelines to sample model outputs from a production system, sending these to a specialized evaluation system, and tracking the scores. The method for sampling data - such as the data distribution desired, the adjustment of the sampling frequency so that it is appropriate (sampling and measuring quality often on a large amount of data may give more confidence in the quality, but would be expensive), considerations around protecting data in transit and retaining it based on policies all add to the complexity. In addition, the instructions for scoring, interpretation of the scores, type of LLM used for scoring, type of input data and so on can all change often.
A fantastic-looking prototype or demo running on a laptop is very promising, but it is the first step in a long journey that allows you to be assured of the quality of the models’ outputs in a production system.
In effect, the real work of preparing and annotating data to create a reliable evaluation takes place after the development of the initial prompt that shows promising results.
- Nirmal Mukhi, VP AI Engineering
Safety considerations
The generality and power of LLMs is a double-edged sword. Ultimately, they are models that predict and generate the text that, based on the huge amount of pre-trained data they have been exposed to, best matches the context and instructions they are given - they will “do their best” to produce an answer. But they don’t have any in-built notion of confidence or accuracy.
They can hallucinate, that is, generate outputs that make completely baseless claims not supported by the provided context. Some hallucinations could be harmless, while others could have detrimental consequences and can risk damaging the brand.
Hallucinations can be minimized with clever system design and by following the best practices in prompting (instructions to the LLM). But the chances of hallucinating are always non-zero; it is a fundamental limitation of how LLMs work. Hence, it becomes imperative to monitor them, and continuously attempt to improve the hallucination prevention approach, whatever that may be.
Beyond hallucinations, since LLMs can do so many things, getting them to “stay on script” when solving a problem can be challenging. An LLM being used to answer questions about insurance policies could also answer questions about Newton’s laws - and constraining it to a domain is akin to teaching an elephant to balance on a tightrope. It’s trying to limit an immensely powerful tool to narrow its focus to one problem, such as the customer query that they need to resolve right in front of them.
One solution to these problems is to fine-tune an LLM so that it is “better behaved” at solving a specific problem. The process of fine-tuning involves collecting high-quality data that allows the model to be trained to follow specific behaviors, following techniques such as few-shot learning, or Reinforcement Learning with Human Feedback (RLHF). Doing so however takes specialized skills: even assembling an appropriate dataset can be challenging; ideally, fine-tuning would be a periodic if not continuous process - and this is difficult to achieve. Thus, managing a large number of fine-tuned models and keeping them up to date will require specialized skills, expertise, and resources.
Inherent Limitations
While LLMs can do amazing things, they cannot do everything. Their reasoning and math capabilities can fail. Even a task like text matching can be hard to get right with an LLM (or at any rate can be much more easily and reliably solved with a regular expression matcher). So they aren’t always the right tool to use, even for NLP problems. These issues are inherent to how LLMs work. They are probabilistic systems that are optimized for generating a sequence of words, and are thus ill-suited for tasks that are completely deterministic.
Sophisticated prompting techniques can be designed to work around some of these limitations. Another approach is to attempt to solve a multi-step problem that involves complex reasoning using agentic orchestration. A different approach would allow for a model pipeline where tasks suited to being solved using LLMs are routed to LLMs, while tasks suited to being solved via a deterministic system or regular expression matcher are routed to other models. Identifying and designing for these situations is a requirement to support a diversity of use cases.
Managing and running production systems
While vendors like OpenAI and Anthropic have made LLMs relatively cheap to run, they still use complex serving architecture and hardware. Many such LLM hosting platforms are still in beta, and typical service level agreements (SLAs), when supported, promise far below 99.99% availability (rarely guaranteed). This risk can be managed by adding fallbacks and other mechanisms, and represents additional effort to build a production system.
And, in the end, building on LLMs is all software development and needs to follow standard software development processes. An LLM or even just a prompt is an artifact that has to be authored, evaluated, versioned, and released carefully just like any software artifact or ML model. You need observability, the ability to conduct experiments, auditability and so on. While this is all fairly standard, the discipline of MLOps introduces an additional layer of complexity because of the need to continuously monitor and tune for safety (including security concerns like prompt injection) and hallucinations. Additional resources need to be made available to handle such tasks.
A contact center example
Consider the problem of offering conversation summarization for all your conversations with customers, across voice and digital channels. The NLP problem to be solved here, that of abstractive summarization on multi-participant conversations with a potentially large number of turns, was difficult to solve as recently as 2021. Now, it is trivial to address using LLMs - writing a suitable prompt for an LLM to produce a summary that superficially looks high quality is easy.
But how good would the generated summaries be at scale as an enterprise solution? It is possible that the LLM might hallucinate and generate summaries that could include completely fictitious information that’s not easily discerned from the conversation itself. But would that happen 1% of the time, or 0.001% of the time…and how harmful would those hallucinations be? Adding the actual number of interactions into consideration, a 1% rate could mean 1 customer interaction out of 100, but as you scale up the interactions you could suddenly be faced with 1000 customers out of 100,000 interactions. Evaluating the quality of the prompt, and optimizing it, would require the creation of an evaluation dataset. Detecting hallucinations and classifying them into different categories so that they can be mitigated would take extra effort - and preventing or at least minimizing the chances that they occur, even more so.
Summarization could be solved using just about any good LLM - but which one would provide the best balance between cost and quality? That’s not a simple question to answer - it would require the availability of an evaluation dataset, followed by a lot of trials with different models and applications of prompt optimization.
A one-size-fits-all summary output rarely meets the needs of an enterprise. For example, the line of business responsible for handling customer care may want the summary to explicitly include details on any promised follow-up in the case of a complaint. The line of business responsible for sales may want to include information about promotions offered or whether the customer purchased the product or service, and so on. Serving these requirements would mean managing intricate configurations down to having different prompts to serve different use cases, or ideally fine-tuned LLMs that better serve the specific needs of these businesses. These requirements may change often, and so would need for versioning, change management and careful rollouts.
Summary quality would need to be monitored, and as changes in technology (such as improvements in models, inference servers or hardware) occur, things would need to be updated. Consider the availability of a new LLM that is launched with a lot of buzz in the industry. It would have to be evaluated to determine its effectiveness at summarization of the sort you are doing - which would mean updating the various prompts underlying the system, and checking this model’s output against the evaluation dataset, itself compiled from samples of data from a variety of use cases. Let’s say that it appears to produce higher quality summaries at a lower price on the evaluation dataset, and a decision is taken to roll it into production. This would have to be monitored and the expected boost in performance and reduction in price verified. In case something does go wrong (say it is cheaper and better…but takes unacceptably long and customers complain about the latency of producing summaries), it would need to be rolled back.
What about feedback loops? Perhaps summaries could be rated, or they could be edited. Edited summaries or highly rated ones could be used to fine tune a model to improve performance or lower cost by moving to a smaller, fine-tuned model.
This is not an exhaustive list of considerations - and this example is only about summarization, which is a ubiquitous, commodity capability in the LLM world. More complex use cases requiring agentic orchestration with far more sophisticated prompting techniques require more thought and effort to deploy responsibly.
Conclusions
Pre-trained foundational large language models have changed the paradigm of how ML solutions are built. But, as always, there’s no free lunch. Enterprises attempting to build from scratch using LLMs have to account for the total lifetime cost of maintaining such a solution from a business standpoint.
There is a point early in the deployment of an LLM-based solution where things look great - hallucinations haven’t been noticed, edge cases haven’t been encountered, and the value being derived is significant. However, this can lead to a false confidence, and under-investing at this stage is a dangerous fallacy. Without sufficient continued investment, the risk of having this solution in production without the necessary fallbacks, safeguards and flexibility will be an ever-present non-linear risk. Going beyond prototyping is therefore harder than it might seem.
An apt analogy is architecting a data center. Purchasing commodity hardware, and using open source software for running servers, managing clusters, setting up virtualization, and so on are all possible. But putting that package together and maintaining it for the long haul is enough of a burden that enterprises would prefer to use public cloud providers in most cases.
Similarly, when choosing whether to build AI solutions or partner with vendors who have the experience deploying enterprise solutions, organizations should be clear-eyed about the choices they are making, understand the long-term costs and the tradeoff associated with them.
Want more? Explore how to identify the ideal use cases for a generative AI agent
Figuring out how to get started with a generative AI agent might seem like a daunting task. But it doesn’t have to be. Ensuring a successful deployment depends on choosing the best use cases to target first. We make it simple with a reliable, data-driven approach to identify interaction types with high automation potential.
Download our ebook on identifying use cases to get the details on how our proven process works. You’ll learn how we identify which interactions have the best automation potential, and how you can get started without overhauling your knowledge base.
Getting through the last mile of generative AI agent development
There’s been a huge explosion in large language models (LLMs) over the past two years. It can be hard to keep up – much less figure out where the real value is happening. The performance of these LLMs is impressive, but how do they deliver consistent and reliable business results?
OpenAI demonstrated that very, very large models, trained on very large amounts of data, can be surprisingly useful. Since then, there’s been a lot of innovation in the commercial and open-source spaces; it seems like every other day there’s a new model that beats others on public benchmarks.
These days, most of the innovation in LLMs isn’t even really coming from the language part. It’s coming in three different places:
- Tool use: That’s the ability of the LLM to basically call functions. There’s a good argument to be made that tool use defines intelligence in humans and animals, so it’s pretty important.
- Task-specific innovations: This refers to innovations that deliver considerable improvements in some kind of very narrow domain, such as answering binary questions or summarizing research data.
- Multimodality: This is the ability of models to incorporate images, audio, and other kinds of non-text data as inputs.
New capabilities – and challenges
Two really exciting things emerge out of these innovations. First, it’s much easier to create prototypes of new solutions – instead of needing to collect data to make a Machine Learning (ML) model, you can just write a specification, or in other words, a prompt. Many solution providers are jumping on that quick prototyping process to roll out new features, which simply “wrap” a LLM and a prompt. Because the process is so fast and inexpensive, these so-called “feature factories” can create a bunch of new features and then see what sticks.
Second, making LLMs useful in real time relies on the LLM not using its “intrinsic knowledge” – that is, what it learned during training. It is more valuable instead to feed it contextually relevant data, and then have it produce a response based on that data. This is commonly called retrieval augmented generation, or RAG. So as a result, there are many companies making it easier to put your data inside the LLM – connecting it to search engines, databases, and more.
The thing about these rapidly developed capabilities is that they always place the burden of making the technology work on you and your organization. You need to evaluate if the LLM-based feature works for your business. You need to determine if the RAG-type solution solves a problem you have. And you need to figure out how to drive adoption. How do you then evaluate the performance of those things? And how many edge cases do you have to test to make sure it is dependable?
This “last mile” in the AI solution development and deployment process costs time and resources. It also requires specific expertise to get it right.
Getting over the finish line with a generative AI agent
High-quality LLMs are widely available. Their performance is dramatically improved by RAG. And it’s easier than ever to spin up new prototypes. That still doesn’t make it easy to develop a customer-facing generative AI solution that delivers reliable value. Enabling all of this new technology - namely, LLMs capable of using contextually relevant tools at their disposal - requires expertise to make sure that it works, and that it doesn’t cause more problems than it solves.
It takes a deep understanding of the performance of each system, how customers interact with the system, and how to reliably configure and customize the solution for your business. This expertise is where the real value comes from.
Plenty of solution providers can stand up an AI agent that uses the best LLMs enhanced with RAG to answer customers’ questions. But not all of them cover that last mile to make everything work for contact center purposes, and to work well, such that you can confidently deploy it to your customers, without worrying about your AI agent mishandling queries and causing customer frustration.
Generative AI services provided by the major public cloud providers can offer foundational capabilities. And feature factories churn out a lot of products. But neither one gets you across the finish line with a generative AI agent you can trust. Developing solutions that add value requires investing in input safety, output safety, appropriate language, task adherence, solution effectiveness, continued monitoring and refining, and more. And it takes significant technical sophistication to optimize the system to work for real consumers.
That should narrow your list of viable vendors for generative AI agents to those that don’t abandon you in the last mile.
Want more? Explore how we help you identify the ideal use cases for a generative AI agent
Figuring out how to get started with a generative AI agent might seem like a daunting task. But it doesn’t have to be. Ensuring a successful deployment depends on choosing the best use cases to target first. We make it simple with a reliable, data-driven approach to identify interaction types with high automation potential.
Download our ebook on identifying use cases to get the details on how our proven process works. You’ll learn how we identify which interactions have the best automation potential, and how you can get started without overhauling your knowledge base.
A new era of unprecedented capacity in the contact center
Ever heard the phrase, "Customer service is broken?"
It's melodramatic, right? —something a Southern lawyer might declaim with a raised finger. Regardless, there’s some truth to it, and the reason is a deadly combination of interaction volume and staffing issues. Too many inbound interactions, too few people to handle them. The demands of scale do, in fact, break customer service.
This challenge of scaling up is a natural phenomenon. You find it everywhere, from customer service to pizza parlors.
Too much appetite, too little dough
If you want to scale a pizza, you have to stretch the dough, but you can't stretch it infinitely. There’s a limit. Stretch it too far, and it breaks.
Customer service isn't exactly physical, but physical beings deliver it— the kind who have bad days, sickness, and fatigue. When you stretch physical things too far (like balloons, hamstrings, or contact center agents), they break. In contact centers, broken agents lead to broken customer service.
Contact centers are currently stretched pretty thin. Sixty-three percent of them face staffing shortages. Why are they struggling? Some cite rising churn rates year after year. Others note shrinking agent workforces in North America and Europe. While workers flee agent jobs for coding classes, pastry school, and duck farming, customer request volumes are up. In 2022, McKinsey reported that 61% of customer care leaders claimed a growth in total calls.
To put it in pizza terms (because why not?), your agent dough ball is shrinking even as your customers' insatiable pizza appetite expands.
What’s a contact center to do? There are two predominant strategies right now:
- reduce request volumes (shrink the appetite)
- stretch your contact center’s service capacity (expand the dough)
Contact centers seem intent on squeezing more out of their digital self-service capabilities in an attempt to contain interactions and shrink agent queues. At the same time, they’re feverishly investing in technology to expand capacity with performance management, process automation and real-time agent support.
But even with both of these strategies working at full force, contact centers are struggling to keep up. Interaction volume continues to increase, while agent turnover carries on unabated. Too much appetite. Not enough dough to go around.
How do we make more dough?
Here’s the harsh reality – interaction volume isn’t going to slow down. Customers will always need support and service, and traditional self-service options can’t handle the scope and complexity of their needs. We’ll never reduce the appetite for customer service enough to solve the problem.
We need more dough. And that means we need to understand the recipe for customer service and find a way to scale it. The recipe is no secret. It’s what your best agents do every day:
- Listen to the customer
- Understand their needs
- Propose helpful solutions
- Take action to resolve the issue
The real question is, how do we scale the recipe up when staffing is already a serious challenge?
Scaling up the recipe for customer service
We need to scale up capacity in the contact center without scaling up the workforce. Until recently, that idea was little more than a pipe dream. But the emergence of generative AI agents has created new opportunities to solve the long-running problem of agent attrition and its impact on CX.
Generative AI agents are a perfect match for the task. Like your best human agents, they can and should listen, understand, propose solutions, and take action to resolve customers’ issues. When you combine these foundational capabilities into a generative AI agent to automate customer interactions, you expand your contact center’s capacity – without having to hire additional agents.
Here’s how generative AI tools can and should help you scale up the recipe for customer service:
- Generative AI should listen to the customer
Great customer service starts with listening. Your best agents engage in active listening to ensure that they take in every word the customer is saying. Transcription solutions powered by generative AI should do the same. The most advanced solutions combine speed and exceptional accuracy to capture conversations in the moment, even in challenging acoustic environments.
- Generative AI should understand the customer’s needs
Your best agents figure out what the customer wants by listening and interpreting what the customer says. An insights and summarization solution powered by generative AI can also determine customer intent, needs, and sentiment. The best ones don’t wait until after the conversation to generate the summary and related data. They do it in real time.
- Generative AI should propose helpful solutions
With effective listening and understanding capabilities in place, generative AI can provide real-time contextual guidance for agents. Throughout a customer interaction, agents perform a wide range of tasks – listening to the customer, interpreting their needs, accessing relevant information, problem-solving, and crafting responses that move the interaction toward resolution. It’s a lot to juggle. Generative AI that proposes helpful solutions at the right time can ease both the cognitive load and manual typing burden on agents, allowing them to focus more effectively on the customer.
- Generative AI should take action to resolve customers’ issues
This is where generative AI combines all of its capabilities to improve customer service. It can integrate the ingredients of customer care—listening, understanding, and proposing—to safely and autonomously act on complex customer interactions. More than a conversational bot, it can resolve customers’ issues by proposing and executing the right actions, and autonomously determining which backend systems to use to retrieve information and securely perform issue-resolving tasks.
Service with a stretch: Expanding your ball of dough
Many contact centers are already using generative AI to listen, understand, and propose. But it’s generative AI’s ability to take action based on those other qualities that dramatically stretches contact center capacity (without breaking your agents).
A growing number of brands have already rolled out fully capable generative AI agents that handle Tier 1 customer interactions autonomously from start to finish. That does more than augment your agents’ capabilities or increase efficiency in your workflows. It expands your frontline team without the endless drain of recruiting, onboarding, and training caused by high agent turnover.
A single generative AI agent can handle multiple interactions at the same time. And when paired with a human agent who provides direct oversight, a generative AI agent can achieve one-to-many concurrency even with voice interactions. So when inbound volume spikes, your generative AI agent scales up to handle it.
More dough. More capacity. All without stretching your employees to the breaking point. For contact center leaders, that really might be as good as pizza for life.
Want more? Read our eBook on the impact of agent churn
Agent churn rates are historically high, and the problem persists no matter what we throw at it — greater schedule flexibility, gamified performance dashboards, and even higher pay.
Instead of incremental changes to timeworn tools, what if we could bypass the problem altogether?
Download the Agent Churn: Go Through It or Around It? eBook to learn why traditional strategies for agent retention aren't working, and how generative AI enables a radical new paradigm.
Not all contact center automation is the same
At ASAPP we develop AI models to improve agent performance in the contact center. Many of these models directly assist contact center agents by automating parts of their workflow. For example, the automated responses generated by AutoCompose - part of our ASAPPMessaging platform - suggest to an agent what to say at a given point during a customer conversation. Agents often use our suggestions by clicking and sending them.
How we measure performance matters
While usage of the suggestions is a great indicator of whether the agents like the features, we’re even more interested in the impact the automation has on performance metrics like agent handle time, concurrency, and throughput. These metrics are ultimately how we measure agent performance when evaluating the impact of a product like the AutoCompose capabilities of ASAPPMessaging, but these metrics can be affected by things beyond AutoCompose usage, like changes in customer intents or poorly-planned workforce management.
To isolate the impact of AutoCompose usage on agent efficiency, we prefer to measure the specific performance gains from each individual usage of AutoCompose. We do this by measuring the impact of automated responses on agent response time, because response time is more invariant to intent shifts and organizational effects than handle time, concurrency and throughput.
By doing this, we can further analyze:
- The types of agent utterances that are most often automated
- The impact of the automated responses when different types of messages are used (in terms of time savings)
Altogether, this enables us to be data-driven about how we improve models and develop new features to have maximum impact.
Going beyond greeting and closing messages
When we train AI models to automate responses for agents, the models look for patterns in the data that can predict what to say next based on past conversation language. So the easiest things for models to learn well are the types of messages that occur often and without much variation across different types of conversations, e.g. greetings and closings. Agents typically greet and end a conversation with a customer the same way, perhaps with some specificity based on the customer’s intent.
Most AI-driven automated response products will correctly suggest greeting and closing messages at the correct time in the conversation. This typically accounts for the first 10-20% of automated response usage rates. But when we evaluate the impact of automating those types of messages, we see that it’s minimal.
To understand this, let’s look at how we measure impact. We compare agents’ response times when using automated responses against their response times when not using automated responses. The difference in time is the impact—it’s the time savings we can credit to the automation.
Without automation, agents are not manually typing greeting and closing messages for every conversation. Rather they’re copying and pasting from notepad or word documents containing their favorite messages. Agents are effective at this because they do it several times per conversation. They know exactly where their favorite messages are located, and they can quickly copy and paste them into their chat window. Each greeting or closing message might take an agent 2 seconds. When we automate those types of messages, all we are actually automating is the 2-second copy/paste. So when we see automation rates of 10-20%, we are likely only seeing a minimal impact on agent performance.
The impact lies in automating the middle of the conversation.
If automating the beginnings and endings of conversations is not that impactful, what is?
Automating the middle of the conversation is where response times are naturally slowest and where automation can yield the most agent performance impact.
- Heather Reed, Product Manager, ASAPP
The agent may not know exactly what to say next, requiring time to think or look up the right answers. It’s unlikely that the agent has a script readily available for copying or pasting. If they do, they are not nearly as efficient as they are with their frequently used greetings and closings.
Where it was easy for AI models to learn the beginnings and endings of conversations, because they most often occur the same way, the exact opposite is true of the middle parts of conversations. Often, this is where the most diversity in dialog occurs. Agents handle a variety of customer problems, and they solve them in a variety of ways. This results in extremely varied language throughout the middle parts of conversations, making it hard for AI models to predict what to say at the right time.
ASAPP’s research delivers the biggest improvements
Whole interaction models are exactly what the research team at ASAPP specializes in developing. And it’s the reason that AutoCompose is so effective. If we look at AutoCompose usage rates throughout a conversation, we see that while there is a higher usage at the beginnings and endings of conversations, AutoCompose still automates over half of agent responses in between.
The low response times in the middle of conversations are where we see the biggest improvements in agent response time. It’s also where the biggest opportunities for improvements are realized.
Whole interaction automation shows widespread improvements
ASAPP’s current automated response rate is about 80%. It has taken a lot of model improvements, new features, and user-tested designs to get there. But now agents effectively use our automated responses in the messaging platform to reduce handle times by 25%, enabling an additional 15% of agent concurrency, for a combined improvement in throughput of 53%. The AI model continues to get better with use, improving suggestions, and becoming more useful to agents and customers.
Want more? Learn to choose the right digital platform for your contact center
Customers expect a rich digital experience that’s efficient, convenient and easy – and most importantly, resolves their issues. To deliver the kind of experience customers expect and to reap the business benefits of digital CX, you have to get beyond the basics.
Download the eBook to learn how to choose a digital platform that will equip you to create digital experiences that satisfy your customers, boost efficiency in your contact center and lower your costs.
Ensuring generative AI agent success with human-in-the-loop
A bridge to the future
Talk of Generative AI transforming contact centers is everywhere. From boardrooms of the world’s largest enterprises to middle management and frontline agents, there’s a near-universal consensus: the impact will be massive. Yet, the path from today’s reality to that future vision is full of challenges.
The key to bringing customer-facing Gen AI to production today lies in a human-in-the-loop workflow, where AI agents consult human advisors whenever needed—just as frontline agents might seek guidance from a tier 2 agent or supervisor. This approach both unlocks benefit today while providing a self-learning mechanism driving greater automation in the future.
Challenges today
We've spilled our fair share of digital ink on how Generative AI marks a step change in automation and how it will revolutionize the contact center. But let's take a moment to acknowledge the key challenges that must be overcome to realize this vision.
- Safety: The stochastic nature of AI means it can "hallucinate"—producing incorrect or misleading outputs.
- Authority: Not every customer interaction is ready for full automation; certain decisions still require human judgment.
- Customer request: Regardless of how advanced an AI agent is, some customers will insist on speaking with a human agent.
- Access: Many systems used by agents today to resolve customer issues lack APIs and are thus inaccessible to Gen AI agents.
Safety: Constant vigilance
AI's ability to generate human-like responses makes it powerful, but this capability comes with risks. Gen AI systems are inherently probabilistic, meaning they can occasionally hallucinate—providing incorrect or misleading information. These mistakes, if unchecked, can erode customer trust at best and damage a brand at worst.
When evaluating a Gen AI system, the critical question to ask is not “will the system hallucinate?” It will. The critical question is, “will hallucinated output get sent to my customers?”
With GenerativeAgent every outgoing message is evaluated by an output safety module that, among other things, checks for hallucinations. If a possible hallucination is detected, the message is assigned to a human-in-the-loop advisor for review. This safeguard prevents the dissemination of erroneous information while allowing the virtual agent to maintain control of the interaction.
In systems without this kind of human oversight, mistakes could lead to escalations or, worse, unresolved customer issues, or worse yet, damaged customer relationships or brand. By integrating a human-in-the-loop workflow, businesses can mitigate the risks inherent to this technology while maximizing the business benefits it offers.
Authority: Responsible delegation
Deploying customer-facing Gen AI in a contact center doesn’t mean relinquishing all decision-making power to the AI. There are situations where human judgment is crucial, particularly when the stakes are high for both the business and the customer.
Take, for example, a customer requesting a payment extension or a final review of their loan application. These are pivotal decisions that involve weighing factors like risk, compliance, and customer loyalty. While a Gen AI agent can automate much of the interaction leading up to these moments, the final decision still requires human judgment.
With a human-in-the-loop framework, this is no longer a limitation. The AI agent can seamlessly consult an advisor for approval while maintaining control of the customer interaction, much like a Tier 1 agent consulting a supervisor. This approach provides flexibility—automating what can be automated while reserving human intervention for critical decisions. For the business, this boosts automation rates, reduces escalations, and empowers advisors to focus on high-impact judgments where their expertise is essential.
Customer request: Last ditch containment
No matter how advanced Gen AI virtual agents become, some customers will still ask to speak with a live agent—at least for now. Of all the challenges outlined, this one is the most intractable, driven by customer habits and expectations rather than technology or process.
Human-in-the-loop offers a smart alternative. Rather than instantly escalating to a live agent when requested, the Gen AI agent could inform the customer of the wait time and propose a review by a human advisor in the meantime. If the advisor can help the Gen AI agent resolve the issue quickly, the escalation can be avoided entirely—delivering a faster resolution without disrupting the flow of the interaction.
While not every customer will accept this deflection, it can significantly reduce this type of escalation and address one of the toughest barriers to full automation.
Access: Ready for today, better tomorrow
Many systems that agents rely on today lack APIs, putting a large volume of customer issues beyond the reach of automation.
Rather than dismissing these cases, a human-in-the-loop advisor can bridge this gap—handling tasks on behalf of the Gen AI agent in systems that lack APIs.
The value extends beyond expanding automation. It highlights areas where developing APIs or streamlining access to legacy systems could deliver significant ROI. What was once hard to justify becomes clear: introducing new APIs directly reduces the advisor hours spent compensating for their absence.
Conclusion
Generative AI is poised to reshape the contact center, but unlocking its full potential requires a tightly integrated human-in-the-loop workflow. This framework ensures that customer-facing Gen AI is ready for production today while providing a self-learning mechanism driving to greater automation gains tomorrow.
While AI technology continues to advance, the role of human advisors is not diminishing—it’s evolving. By embracing this collaboration, businesses can strike the right balance between automation and human judgment, leading to more efficient operations and better customer experiences.
In future posts on human-in-the-loop, we’ll dive deeper into how to enhance collaboration between AI and human agents, and explore the evolving role of humans in AI-driven contact centers.
Want more? Discover how GenerativeAgent ends bad customer service - for good
The modern contact center is 40 years old. That's decades of bad customer service. Customer happiness is at a 17-year low, agents hate their jobs, and companies are investing billions to run their contact centers. It's time for a change.
Watch the GenerativeAgent launch video to see how it handles complex customer queries with human-in-the-loop escalation when necessary, all while maintaining seamless interactions with the customer.
AI security and AI safety: Navigating the landscape for trustworthy generative AI
AI security & AI safety
In the rapidly evolving landscape of generative AI, the terms "security" and "safety" often crop up. While they might sound synonymous, they represent two distinct aspects of AI that demand attention for a comprehensive and trustworthy AI system. Let's dive into these concepts and explore how they shape the development and deployment of generative AI, using real-world examples from contact centers to make sense of these crucial elements. To start, here is a quick overview video on AI security and AI safety:
AI security: The shield against malicious threats
When we think about AI security, it's crucial to differentiate between novel AI-specific risks and security risks that are common across all types of applications, not just AI.
The reality is that over 90% of AI security efforts are dedicated to addressing critical basics and foundational security controls. These include data protection, encryption, data retention, PII redaction, authorization, and secure APIs. It’s important to understand that while novel AI-specific threats like prompt injection - where a malicious actor manipulates input to retrieve unauthorized data or inject system commands - do exist, they represent a smaller portion of the overall security landscape.
Let's consider a contact center chatbot powered by AI. A user might attempt to embed harmful scripts within their query, aiming to manipulate the AI into disclosing sensitive customer information, like social security numbers or credit card details. While this novel threat is significant, the primary defense lies in robust foundational security measures. These include input validation, strong data protection, employing encryption for sensitive information, and implementing strict authorization and data access controls.
Secure API access is another essential cornerstone. Ensuring that all API endpoints are authenticated and authorized prevents unauthorized access and data breaches. In addition to these basics, implementing multiple layers of defense helps mitigate novel threats. Input safety mechanisms can detect and block exploit attempts, preventing abuse like prompt leaks and code injections. Advanced Web Application Firewalls (WAFs) also play a vital role in defending against injection attacks, similar to defending against common application threats like SQL injection.
Continuous monitoring and logging of all interactions with the AI system is very important in detecting any suspicious activities. For example, an alert system can flag unusual API access patterns or data requests by an AI system, enabling rapid response to potential threats. Furthermore, a solid incident response plan is indispensable. It allows the security team to swiftly contain and mitigate the impact of any security events or breaches.
So while novel AI-specific risks do pose a threat, the lion’s share of AI security focuses on foundational security measures that are universal across all applications. By getting the basics right we build a robust shield around our AI systems, ensuring they remain resilient against both traditional and emerging threats.
AI safety: The guardrails for ethical and reliable AI
While AI security acts as a shield, AI safety functions like guardrails, ensuring the AI operates ethically and reliably. This involves measures to prevent unintended harm, ensure fairness, and adhere to ethical guidelines.
Imagine a scenario where an AI Agent in a contact center is tasked with prioritizing customer support tickets. Without proper safety measures, the AI could inadvertently favor tickets from specific types of customers, perhaps due to biased training data that inadvertently emphasizes certain demographics or issues. This could result in longer wait times and dissatisfaction for overlooked customers. To combat this, organizations should implement bias mitigation techniques, such as diverse training datasets. Regular audits and red teaming are essential to identify and rectify any inherent biases, promoting fair and just AI outputs. Establishing and adhering to ethical guidelines further ensures that the AI does not produce unfair or misleading prioritization decisions.
An important aspect of AI safety is addressing AI hallucinations, where the AI generates outputs that aren't grounded in reality or intended context. This can result in the AI fabricating information or providing incorrect responses. For instance, a customer service AI Agent might confidently present incorrect policy details if it isn't properly trained and grounded. Output safety layers and content filters play a crucial role here, monitoring outputs to catch and block any harmful or inappropriate content.
Implementing a human-in-the-loop process adds another layer of protection. Human operators can be called on to intervene when necessary, ensuring critical decisions are accurate and ethical. For example, contact center human agents can be the final step of authorization before performing a critical task, or providing additional insight when the AI system produces incorrect output or does not have enough information to support a user.
The intersection of security and safety
Though AI security and AI safety address different aspects of AI operation, they often overlap. A breach in AI security can lead to safety concerns if malicious actors manage to manipulate the AI's outputs. Conversely, inadequate safety measures can expose the system to security threats by allowing the AI to make incorrect or dangerous decisions.
Consider a scenario where a breach allows unauthorized access to the contact center’s AI system. The attackers could manipulate the AI to route calls improperly, causing delays and customer frustration. Conversely, if the AI's safety protocols are weak, it might inaccurately redirect emergency calls to non-critical queues, posing serious risks. Therefore, a balanced approach that addresses both security and safety is essential for developing a trustworthy generative AI solution.
Balanced approach for trustworthy AI
Understanding the distinction between AI security and AI safety is pivotal for building robust AI systems. Security measures protect the AI system from external threats, ensuring the integrity, confidentiality, and availability of data. Meanwhile, safety measures ensure that the AI operates ethically, producing accurate outputs.
By focusing on both security and safety, organizations can mitigate risks, enhance user trust, and responsibly unlock the full potential of generative AI. This dual focus ensures not only the operational integrity of AI systems but also their ethical and fair use, paving the road for a future where AI technologies are secure, reliable, and trustworthy.
Want more? Learn what to watch out for before trusting an AI vendor with your data and your brand
Alisher Yuldashev, ASAPP expert in AI safety and trust, will walk you through the key considerations for eliminating unreliable providers and identifying the trustworthy partners. You’ll discover how a trustworthy provider should manage your data (including usage in ML training), have AI-specific safety measures, manage model development with robust quality assurance processes, and much more.
Three reasons you're not seeing the value you were promised from your digital chat platform.
Are you not getting the value that you were hoping for from your digital chat platform?
If so, this blog and our recent on-demand webinar are for you.
We recently had a great webinar featuring Melissa Price, VP, CX Digital Self-Service at Altice USA and the experts at ASAPP. In it, we talked about what is holding people back from seeing the value they want out of their existing digital chat platform, what they should look for in a new one, and how they can make that migration as painless as possible.
The conversation is ripe with solutions for some of the widespread pain points CX leaders are experiencing with their digital chat platforms.
Here are some key takeaways from the webinar:
3 Underlying Challenges Holding Platforms Back
In the webinar, we discussed three critical issues that often prevent customer engagement platforms from succeeding:
- Fragmented Customer Journeys: Disconnected touchpoints lead to inconsistent customer experiences. Automation on many platforms is not customer-friendly or efficient, often leading to frustration rather than satisfaction.
- Lack of Adaptability: Many platforms suffer from a lack of ongoing innovation and adaptability. Without continuous improvement loops and the support to keep up with evolving needs, platforms become outdated quickly and fail to deliver the optimized experiences necessary for customer retention and satisfaction. This static approach that many platforms offer contrasts sharply with the dynamic nature of customer service needs.
- Limited Effectiveness of Bolted-on AI: Many legacy chat providers rely on bolted-on AI solutions rather than being AI-native, resulting in subpar suggested responses, ultimately failing to deliver the additional efficiencies, predictive insights, and automation that AI should provide.
How Modern Chat Platforms, Like ASAPP, Address These Problems
Moder, AI-Native® digital chat platforms, like ASAPPMessaging tackles these challenges head-on:
- Seamless customer experience Integration: ASAPP Messaging unifies customer interactions, ensuring a cohesive journey.
- Continuously improving AI-Enhanced Workflows: By automating routine tasks, AI frees agents to focus on complex issues, enhancing both efficiency and satisfaction. The AI continually learns and adapts with use, leading to smarter workflows and ongoing improvements.
- Advanced AI Capabilities: The platform’s AI not only predicts customer needs but also provides real-time support to agents, enhancing decision-making.
What to Look for in a Digital Chat Platform
Choosing the right digital chat platform can significantly impact your operational efficiency and customer satisfaction. Here are some key factors to consider:
- Driving Digital Adoption:
- Encourage Digital Shift: Prioritize a platform that promotes digital engagement over traditional voice channels, which are typically more expensive.
- Cost Efficiency: Digital interactions are far more cost-effective than voice calls.
- Reliability: Ensure the platform has a strong track record with minimal outages.
- Customer Behavior Insights: Research the factors that drive customers to choose digital channels over voice to tailor your strategy accordingly.
- User Experience for Agents and Customers:
- Customer-Focused Design: A platform with excellent user experience (UX) will make customers more aware of and inclined to use chat services.
- Agent Empowerment: Good UX for agents is essential to drive adoption and enhance agent productivity through automation.
- Service-Oriented: The technology should be built to serve both agents and customers, focusing on empowerment rather than control.
- AI-Native and Continuous Innovation:
- Harnessing AI: Choose a platform that is AI-native and continuously innovates, especially given the current surge in AI advancements.
- Custom Insights with GenAI: Platforms that leverage Generative AI can provide customized insights and maintain a competitive edge.
- Experimentation and Adaptability: Opt for providers who constantly experiment with both internal and leading external AI models, rather than relying on standard models driven by financial incentives.
Real Results with Altice
Melissa Price, VP of Customer Experience for Digital Self Serve at Altice, shared compelling results achieved through the implementation of ASAPP’s digital chat platform. She highlighted results, including:
- Increased Efficiency: Significant reduction in handling times and increased resolution rates.
- Enhanced Customer Satisfaction: Improved experiences leading to higher customer satisfaction scores.
- Value of AI-Native Solutions: The AI-native approach of ASAPP was crucial in achieving these outcomes, providing deep insights and automation that traditional platforms couldn’t match.
Switching Platforms
There are short term challenges of switching platforms. But, the long-term benefits—including enhanced operational efficiencies, better customer experiences, and staying ahead of technological advancements—far outweigh the initial transition hurdles.
How to make Migration as painless as possible?
To ensure a smooth migration to a new digital chat platform, start by focusing on thorough planning, selecting the right platform with excellent UX and reliability, and leveraging AI for continuous innovation. Engage stakeholders, provide comprehensive training, and implement in phases to minimize disruptions. For a more comprehensive guide, watch the on-demand recording.
Watch the Webinar On-Demand
Curious about how Altice and ASAPP are transforming customer engagement? Watch our on-demand webinar to gain valuable insights into their groundbreaking strategies and real-world successes. It’s a must-watch if your current digital chat platform isn't meeting your expectations.
Best-of-Breed vs. Omnichannel/CCaaS for the Digital Contact Center
Choosing the right CX software can be difficult. Learn how to think about choosing between a best-of-breed digital channel solution vs. an omnichannel solution.
Choosing the right CX software can be difficult. In this blog series, our goal is to give you helpful frameworks through which to think about how to purchase the right CX software to help your contact center improve customer experience, augment agent productivity, and reduce costs. (For a thorough breakdown of how to choose the right Transcription and Summarization Solutions, read our buyer’s guide here.)
This blog post covers how to think about acquiring the right Digital Customer Interaction Solution (DCIS) for your contact center. Should you choose a best-of-breed solution, specializing in digital channels, or an omnichannel/CCaaS solution, which offers both voice and digital channels?
Let’s dig in.
The Problem with the “All-in-One Solution”
Omnichannel Solutions, also commonly referred to as Contact Center-as-a-Service Solutions, offer an appealing promise: meet all your CX software needs across channels from just one vendor. The only problem? It’s rare to find an omnichannel solution that offers excellent technology for both voice and digital channels.
Most omnichannel solutions were built to optimize for voice channels and tacked on digital support only as an afterthought. As a result, omnichannel solutions tend to deliver subpar digital experiences, such as poor agent UX not optimized for chat workflows, limited bots, and low chat agent usage, frustrating both customers and agents.
It’s difficult to encourage customers to transition to digital support when you don’t have the technology to make their digital experience as useful and delightful as possible.
Of course, the lack of a comprehensive digital solution has significant consequences for your contact center’s operations.
Digital interactions are 1.5x to 4x cheaper than voice interactions. Contact centers understandably want to shift their mix to increase digital support. But without a digital chat platform that can reliably satisfy customers, it can be difficult to change customer behavior and preferences.
In sum, omnichannel solutions present themselves as an all-around answer, but in reality, their digital support options are severely lacking. What they offer in perceived convenience, they often trade in effectiveness.
Below, we break down how to think about best-of-breed vs. omnichannel solutions:
Interested in Learning More About Building a Digital-First Contact Center?
A digital-first contact center is essential to running a modern, efficient, and customer-friendly customer support operation.
Request a demo of ASAPP here to learn how ASAPP customers are:
- Automating Tier 1 interactions
- Experiencing happier, more productive agents
- Improving customer satisfaction
- Increasing revenue with contact center insights
- Drastically lowering operational costs
Conclusion
It’s natural to desire a single platform that can handle voice and digital contacts in one place. But there’s good reason to be skeptical of omni-channel platforms, too: by definition, omni-channel platforms lack specialization. It’s the classic jack of all trades, master of none problem.
Worse, because voice is often more profitable to omni-channel vendors than digital, companies that offer omnichannel platforms often invest the bulk of their development resources in voice over digital, allowing their digital technology to lag behind.
As more customers look to digital to interact with companies, it’s more important than ever that you get digital right. Consider choosing a vendor who specializes in digital channels rather than a broader omni-channel approach.
Who is ASAPP?
ASAPP is the AI-native® software for contact centers.
We help customer service leaders unlock their full value by minimizing costs & inefficiencies, improving agent compliance & productivity, and surfacing actionable insights while helping you deliver a great customer experience.
Our customers are large enterprises who care deeply about leveraging AI to transform CX by delivering unprecedented cost savings and maximizing customer delight.
What do we make?
We make a full suite of AI-native solutions designed specifically for the needs and nuances of the CX industry.
Why we are the right partner?
We strive to be the best technology partner you have ever had.
ASAPP is not new to the AI or the CX space. We have been building AI-native products for the contact center since 2014 and building our own LLMs since 2018. We invest heavily in our products and our workforce to bring our customers the best solutions on the market and the subject matter experts to ensure those customers are getting the maximum benefit.
We offer white-glove service and insight into contact centers’ best practices across industries, and our consultative nature drives transformative results. ASAPP is laser-focused on business outcomes, data usability, and helping you realize your desired customer experience.
If you are interested in how best-of-breed digital software for your contact center can benefit your organization, please click below to schedule a consultation:
Speak to an Expert
Omnichannel solutions present themselves as an all-around answer but in reality their digital support options are severely lacking. What they offer in convenience, they often trade in effectiveness.