Blog
The hidden ways AI-driven speech transcription and analytics improve CX performance
I have spent virtually all of my professional career working with CX and IT leaders in Fortune 500 companies. These companies are investing millions of dollars annually in speech analytics that have maxed out the benefits they can deliver in improving quality management.
Many companies only transcribe a small fraction of their calls and only score a smaller number of those. And often the recordings are batched and stored for later analysis. Having to sift through the data, trying to pull out relevant information, and coaching agents after-the-fact with that information is cumbersome, incomplete, and moves the needle incrementally in terms of improvement.
When I was leading technology operations at contact centers, a consistent theme when re-evaluating our speech analytics toolset was “There are no new insights into what customers are contacting us about. We get reports—but there’s nothing really actionable—so nothing changes.”
We should be asking more from our speech analytics and transcription technology.
Now we can. Advancements in artificial intelligence (AI) raises the bar for what we should expect from our sizable annual investment in this technology.

AI-driven transcription and analysis of every call helps you optimize performance—and gives you a wealth of customer insight.
- Chris Arnold, VP, Customer Experience Strategy, ASAPP
AI quietly powers transcription and speech analytics in real-time and enables us to use the results in hidden ways we did not even realize possible. Examples include:
- Empower agents with coaching support and tools to resolve issues faster
- Get voice of the customer (VoC) insight from 100% of calls
- Analyze customer sentiment in real-time and use machine learning to predict customer satisfaction scores as a call transpires
Supercharge your agents using real-time transcription
Real-time transcription can serve as fuel for your voice agents and accelerate CX performance. Since the majority of customers still contact companies by phone—and voice is the most costly channel—isn’t this where we should be focused?
While it is important to optimize your operations through both live agent and self-serve digital channels, phone calls will continue. Let’s use AI to super-power our agents to make the high-volume voice queues as high performing as possible.
Real-time transcription paired with real-time AI-driven analysis makes it possible to prompt agents with suggested responses and actions based on machine-learning. Additionally, real-time transcription enables automation of thousands of micro-processes and routine tasks, like call summaries.
One of the largest communications companies in North America uses AI to automate the dispositioning of notes at the end of agent calls and realized a 65%[1] reduction in handling time for that specific task. At ASAPP, we have seen large CX organizations who leverage this modernized approach to transcription at scale, reduce their overall CX spend by 30% which translates into hundreds of millions of dollars annually.
Fuel CX operations with voice of customer analysis for 100% of calls
In and of itself, transcription doesn’t make front page news. Very often it’s an expensive component of contact center technology that’s not providing a return on that investment. For instance, most companies are only transcribing 10-20% of their calls due to costs and as a result, business decisions are made without data from more than 80% of customer interactions. That’s not even close to a complete representation of everything happening across the totality of their CX operations.
Today, it’s realistic to transcribe every word of every customer interaction. You can leverage AI to analyze those transcriptions and make real-time decisions that empower agents and improve customer experience in the moment. Highly accurate transcription, coupled with closed-loop machine learning takes the customer experience to another level.
Predict CSAT/NPS with real-time customer sentiment analysis
Every CX leader strives to delight customers—and wants to know how they’re doing. Most use Customer Satisfaction (CSAT) or Net Promoter Scores (NPS) surveys to capture feedback. Yet average survey response rates are between 5% and 15%, depending on the industry. With machine learning, you can now use your transcriptions and speech analytics to predict the sentiment (CSAT or NPS) of every conversation. It’s the equivalent of having the silent 90% provide feedback for every interaction.
Real-time analysis of transcription can discern intent and automatically categorize each customers’ reason for contacting your company. This will give you a deep understanding of exactly what customers are calling in about—and how that compares over time. You can also apply real-time trend and anomaly detection to identify issues and quickly address them before they become catastrophic.
This real-time capture of the voice of the customer is massively valuable to not just contact center leaders, but also Product, Marketing, and Sales teams as well.
Conclusion: Let speech analytics lead the way
Artificial intelligence makes our transcription and speech analytics investment actually meaningful and allows us to make material improvement in CX operations.
If you don’t know the specific drivers behind the interaction metrics within your company, it’s hard to make anything other than incremental changes in your CX programs.
AI lets us analyze every detail from the tens of millions of interactions that occur every year. Not just the metrics—call duration, wait times, etc. but the key drivers behind those metrics. What were the reasons for unexpectedly long handle times… were agents clicking around the knowledge management database trying to find answers? Or how about unbiased opinions on cancel rates… were they due to a product flaw or issues with customer service? Could better save approaches have been used? Or what may have caused the customer sentiment to shift during the interaction?
Imagine capturing 100% of every single customer interaction, whether voice or digital. Imagine having objective insight into drivers behind your contact center metrics. Imagine being able to do that in real-time.
You no longer have to only imagine. No more waiting around for partial transcripts and partial answers. No more manual, subjective scoring of a tiny sampling of your total interactions. The future is here, and we can find it in real-time, automated transcripts and speech analytics:
- Supercharge agents with real-time desktop intelligence
- Identify coaching needs in the moment—get it right for the customer the first time
- Predict CSAT and NPS on 100%, of your interactions
- Gain real-time insights—at a glance understanding of why customers are calling
Real-time transcription, AI-driven analytics, and the ability to quickly act on insights can be your hidden weapon to accelerate transformational change in your contact center.
Collaboration in the digital age: Value at the intersection of people and machines
NLP/AI systems have yet to live up to their promise in customer service in large part because the challenge has been defined as either full automation or failure to automate. As starting from the outside, trying to ‘deflect’ as much traffic/call volume as they can and punting to live serve reps when they fail. The result of this has been hundreds of millions of dollars spent on lowering the cost of customer contact—lots of ‘claimed success’ in terms of deflection rates—and no change in the cost of customer contact or improved customer service. How could this be?
The very essence of ‘conversations’ cannot be replicated by a chat bot with a programmable set of rules: If the customer says this—the bot says that and so on. That’s not how any but the most simplistic of conversations go. Conversations are inherently probabilistic—they involve turn taking which includes disambiguation, successive approximation, backing up and starting over, summarization, clarification and so on.

The future of work will be built on an AI native platform that enables a powerful collaboration between people and machines.
Judith Spitz, PhD
The promise of conversational AI will be realized by a platform that has been designed—explicitly—to enable a collaborative conversation between service reps, AI-powered algorithms and customers—where technology works in concert with an agent to: automate parts of a conversation, hand it over to an agent when needed, make suggestions to agents about what to say next, listen and learn from what your ‘best’ agents are not only saying to customers but are ‘doing’ with your back-end systems, use machine learning to make all your agents as good as your best agent, and enable the customer to gracefully transition a conversation between their channels/device of choice without losing conversational integrity.
The platform should allow your service reps themselves to demonstrate confidence in the auto-suggestions by selecting them with increasing frequency and then the system can use those ‘confidence levels’ to transition from a ‘suggested response’ to an automated response. Automating 50% of 100% of your call volume is a lot better than automating 100% of 10% of your call volume.
The key paradigm shift here is an AI platform that has been built natively—from the ground up—to enable and foster the kind of man-machine collaboration that will be ‘the future of work’—and NOT one that promotes a kind of ‘Frankenstein’ where AI components are bolted on to existing systems—hoping for transformational results.
The future of work will be built on an AI Native® platform that enables a powerful collaboration between people and machines. This is what ASAPP delivers.
Cutting through the complexity using AI
Many channels, one conversational thread. It's what consumers expect.
The consumer is the ultimate winner in the race for accuracy in speech recognition
There is a lot of interest in automatic speech recognition (ASR) for many uses. Thanks to the recent development of efficient training mechanisms and the availability of more computing power, deep neural networks have enabled ASR systems to perform astoundingly well in a number of application domains.
At ASAPP, our focus is in augmenting human performance with AI. Today we do that in large consumer contact centers, where our customers serve consumers over both voice and digital channels. ASR is the backbone that enables us to augment agents in real-time throughout each customer interaction. We build the highest performing ASR system in the world based on industry standard benchmarks. We do this not only by leveraging the technological advancement in deep learning, but also by applying our own innovation to analyze problems at different levels of detail.

At ASAPP we continuously push the limits of what’s possible by not only leveraging technological advances in deep learning, but by also innovating. We are always looking for new ways to analyze problems and explore practical solutions at different levels of detail.
Kyu Han, PhD
LibriSpeech, a speech corpus of 1,000 hours of transcribed audiobooks, has been adopted since its introduction in 2015 as the most used benchmark dataset for ASR research in both academia and industry. Using this dataset, many prestigious research groups around the world including ASAPP, Google Brain, and Facebook AI Research have been testing their new ideas. Never have there been more rapid advances than in the past year for the race to achieve better results on the LibriSpeech testset.
In early 2019, Google’s ASR system with a novel data augmentation method outperformed all previously existing ASR systems by a big margin, boasting a word error rate (WER) of 2.5% on the LibriSpeech test-clean set (shown in the below figure). A word error rate is the percentage of words an ASR system gets wrong, measured against a reference transcript of the given audio. Later the same year, ASAPP joined the race and gained immediate attention with a WER of 2.2%, beating the best performing system at that time by Facebook. The lead, however, didn’t last long as Google in 2020 announced a couple of new systems to reclaim the driver seat in the race, reaching a sub-2.0% WER for the first time. One week after Google’s announcement, ASAPP published a new paper highlighting a 1.75% WER (98.25% accuracy!) to regain the lead. ASAPP remains at the top of the leaderboard (as of September in 2020).

The race will continue, and so will our innovation to make the ultimate winner in this race our customers. Accurate transcriptions feed directly into business benefit for our customer companies, as it enables the ASAPP platform to augment agents—providing real-time predictions of what to say and do to address consumers needs, drafting call summary notes, and automating numerous micro-processes. Plus, having insights from full and accurate transcriptions gives these companies a true voice of the customer perspective to inform a range of business decisions.
At ASAPP, innovation is based on our competent research capability that enabled the aforementioned milestones. But our strength is not only in research but also in an agile engineering culture that makes rapid productization of research innovations possible. This is well exemplified by our recent launch of a multistream convolutional neural network (CNN) model to our production ASR systems.
Multistream CNN—where an input audio is processed with different resolutions for better robustness to noisy audio—is one of the main contributing factors to the successful research outcomes from the LibriSpeech race. Its structure consists of multiple streams of convolution layers, each of which is configured with a unique filter resolution for convolutions. The downside to this kind of model is the extra processing time that causes higher latency due to many future speech frames being processed during ASR decoding. Rather than leaving it as a high-performing, but not feasible-in-production research prototype, we invented a multistream CNN model suitable for real-time ASR processing by dynamically assigning compute resources during decoding, while maintaining the same accuracy level as the slower research-lab prototype. Our current production ASR systems take advantage of this optimized model, offering more reliable transcriptions even for noisy audio signals in the agent-customer conversations of contact centers.
As illustrated in Stanley Kubrick’s 1968 movie 2001: Space Odyssey, human aspiration of creating AI able to understand the way we communicate has led to significant technological advancements in many areas. Deep learning has brought recent revolutionary changes to AI research including ASR, which has taken major leaps in the last decade more so than it did in the last 30 years. The radical improvement of ASR accuracy that would make consumers embrace voice recognition products more comfortably than at any time in history are expected to open up a $30 billion market for ASR technology in the next few years.
As we’re entering an era where our own Odyssey to human-level ASR systems might reach the aspired destination soon, ASAPP as a market leader will continue to invest in rapid innovation for AI technology through balancing cutting-edge research and fine-tuned productization to enhance customer experience in meaningful ways.
Our research work in this area was presented at the Ai4 conference.