Blog
Cutting through the complexity using AI
Many channels, one conversational thread. It's what consumers expect.
The consumer is the ultimate winner in the race for accuracy in speech recognition
There is a lot of interest in automatic speech recognition (ASR) for many uses. Thanks to the recent development of efficient training mechanisms and the availability of more computing power, deep neural networks have enabled ASR systems to perform astoundingly well in a number of application domains.
At ASAPP, our focus is in augmenting human performance with AI. Today we do that in large consumer contact centers, where our customers serve consumers over both voice and digital channels. ASR is the backbone that enables us to augment agents in real-time throughout each customer interaction. We build the highest performing ASR system in the world based on industry standard benchmarks. We do this not only by leveraging the technological advancement in deep learning, but also by applying our own innovation to analyze problems at different levels of detail.
At ASAPP we continuously push the limits of what’s possible by not only leveraging technological advances in deep learning, but by also innovating. We are always looking for new ways to analyze problems and explore practical solutions at different levels of detail.
Kyu Han, PhD
LibriSpeech, a speech corpus of 1,000 hours of transcribed audiobooks, has been adopted since its introduction in 2015 as the most used benchmark dataset for ASR research in both academia and industry. Using this dataset, many prestigious research groups around the world including ASAPP, Google Brain, and Facebook AI Research have been testing their new ideas. Never have there been more rapid advances than in the past year for the race to achieve better results on the LibriSpeech testset.
In early 2019, Google’s ASR system with a novel data augmentation method outperformed all previously existing ASR systems by a big margin, boasting a word error rate (WER) of 2.5% on the LibriSpeech test-clean set (shown in the below figure). A word error rate is the percentage of words an ASR system gets wrong, measured against a reference transcript of the given audio. Later the same year, ASAPP joined the race and gained immediate attention with a WER of 2.2%, beating the best performing system at that time by Facebook. The lead, however, didn’t last long as Google in 2020 announced a couple of new systems to reclaim the driver seat in the race, reaching a sub-2.0% WER for the first time. One week after Google’s announcement, ASAPP published a new paper highlighting a 1.75% WER (98.25% accuracy!) to regain the lead. ASAPP remains at the top of the leaderboard (as of September in 2020).
The race will continue, and so will our innovation to make the ultimate winner in this race our customers. Accurate transcriptions feed directly into business benefit for our customer companies, as it enables the ASAPP platform to augment agents—providing real-time predictions of what to say and do to address consumers needs, drafting call summary notes, and automating numerous micro-processes. Plus, having insights from full and accurate transcriptions gives these companies a true voice of the customer perspective to inform a range of business decisions.
At ASAPP, innovation is based on our competent research capability that enabled the aforementioned milestones. But our strength is not only in research but also in an agile engineering culture that makes rapid productization of research innovations possible. This is well exemplified by our recent launch of a multistream convolutional neural network (CNN) model to our production ASR systems.
Multistream CNN—where an input audio is processed with different resolutions for better robustness to noisy audio—is one of the main contributing factors to the successful research outcomes from the LibriSpeech race. Its structure consists of multiple streams of convolution layers, each of which is configured with a unique filter resolution for convolutions. The downside to this kind of model is the extra processing time that causes higher latency due to many future speech frames being processed during ASR decoding. Rather than leaving it as a high-performing, but not feasible-in-production research prototype, we invented a multistream CNN model suitable for real-time ASR processing by dynamically assigning compute resources during decoding, while maintaining the same accuracy level as the slower research-lab prototype. Our current production ASR systems take advantage of this optimized model, offering more reliable transcriptions even for noisy audio signals in the agent-customer conversations of contact centers.
As illustrated in Stanley Kubrick’s 1968 movie 2001: Space Odyssey, human aspiration of creating AI able to understand the way we communicate has led to significant technological advancements in many areas. Deep learning has brought recent revolutionary changes to AI research including ASR, which has taken major leaps in the last decade more so than it did in the last 30 years. The radical improvement of ASR accuracy that would make consumers embrace voice recognition products more comfortably than at any time in history are expected to open up a $30 billion market for ASR technology in the next few years.
As we’re entering an era where our own Odyssey to human-level ASR systems might reach the aspired destination soon, ASAPP as a market leader will continue to invest in rapid innovation for AI technology through balancing cutting-edge research and fine-tuned productization to enhance customer experience in meaningful ways.
Our research work in this area was presented at the Ai4 conference.
Research feeds product innovation
How machine learning delivers a custom solution for every enterprise
How a focus on agents pays off big
Fast, accurate transcription and analytics can transform your business
The richest CX data on earth isn't being mined
In recent weeks, I’ve had the opportunity to discuss technical innovation with CIOs at several Fortune 500 organizations. I’m always fascinated to hear the story these leaders report how their corporation’s technology and operating framework came to be. Inevitably, every CIO notes how much time and energy their teams have invested in their technology stack, working through many levels of complexity. And they share their frustration at how difficult it is to gain anything more than incremental improvements year over year.
“They’ve been telling me the same thing for five years”
A common challenge faced by most senior leaders is the absence of new insights into what’s driving their CX outcomes. While there are terabytes of data flowing through these businesses, the use of legacy reporting and analytics methodologies renders scant new actionable insights. One CIO told me he no longer asks his teams to provide insights into customer contacts because, “I already know what they are going to tell me because they have been telling me the same thing for the last 5 years.” Antiquated methodologies can be used to report historical performance, but fall very short of providing the customer insights needed to deliver material value.
There’s gold in your data
Intelligent data processing translated into actionable insights can enable even the largest organizations to respond rapidly to ever-changing market dynamics. Most Fortune 500 companies are sitting on a mountain of data and extracting little more than information for historical scorecard reporting. This data is a valuable asset that could drive significant business efficiencies through the application of well-developed artificial intelligence.
Data from all your interactions with customers is a rich source of insights that can significantly improve performance across your organization.
Chris Arnold
ASAPP helps companies mine this gold
Leveraging native, self-learning artificial intelligence enables the largest companies in the world to gain new actionable insights into what’s driving their business outcomes. The ASAPP AI platform provides insight into contact drivers, customer intent and sentiment, trends, effectiveness of promotional offers. and opportunities for enhanced automation and self-service. Machine learning models deployed in both voice and digital environments offer real-time as well as historical insights that lead to double-digit OPEX savings, incremental revenues, and enhanced customer and agent experiences.
Turning insight into action
I work with F500 customers daily in modernizing how they use real-time and historical data to reimagine their CX operations. By deploying the ASAPP AI Native® platform, they are immediately able to identify common topics raised by customers, understand root cause, and see what actions were taken by agents to resolve customer issues.
Sales and Marketing teams can use these insights to drive upsell/cross sell initiatives, promotions to improve loyalty and retention, and enhanced personalization. The Data Science team can leverage agent insights to inform automation flows that will reduce the level of effort for customers to get answers to questions. And, the digital team can use them to update self-service content.
Increasing productivity
These insights also help contact center operations directly. Leveraging data-driven AI, one customer has more than doubled their productivity, with 2.2X increases in resolutions per hour while also realizing a 7 percentage point increase in CSAT. Rather than simply using data for historical reporting, this F500 company is using data to create a leapfrog moment at a very challenging time globally.
Believing there are no new data-driven insights within the CX environment is a costly mistake. Continuing to leverage legacy methodologies will lead to outputs that offer little to address new business challenges. Actionable insights are there, often hiding just below the surface. Identifying and taking action on these insights across the entire organization will result in employees working on critical areas of opportunity rather than old assumptions.