Join us on November 19th at 12pm EST for a webinar: Agent Churn is bad. What if it didn't matter?

Blog

All
Browse by topic
Concurrency
Digital Engagement
Measuring Success
R&D Innovations
Articles

Why AHT isn’t the right measure in an asynchronous and multi-channel world

by 
Heather Reed
Article
Video
Jun 4
2 mins

Operations teams have been using agent handle time (AHT) to measure agent efficiency, manage workforce, and plan operation budgets for decades. However, customers have been increasingly demonstrating they’d prefer to communicate asynchronously—meaning they can interact with agents when it is convenient for them, taking a pause in the conversation and seamlessly resuming minutes or hours later, as they do when multitasking, handling interruptions, and messaging with family and friends.

In this new asynchronous environment, AHT is an inappropriate measure of how long it takes agents to handle a customer’s issue: it overstates the amount of time an agent spends working with a customer. Rather, we consider agent throughput as a better measure of agent efficiency. Throughput is the number of issues an agent handles over some period of time (e.g. 10 issues per hour) and is a better metric for operations planning.

One common strategy for increasing throughput is to merely give agents more issues to handle at once, which we call concurrency. However, attempts to increase throughput by simply increasing an agent’s concurrency without giving them better tools to handle multiple issues at once are short-sighted. Issues that escalate to agents are complex and require significant cognitive load, as “easier” issues have typically already been automated.

Therefore, naively increasing agent concurrency without cognitive load consideration often results in adverse effects on agent throughput, frustrated customers who want faster response times, and agents who burn out quickly.

The ASAPP solution to this is to use an AI-powered flexible concurrency model. A machine learning model measures and forecasts the cognitive demand on agents and dynamically increases concurrency in an effective way. This model considers several factors including customer behaviors, the complexities of issues, and expected work required to resolve the issue to determine an agent’s concurrency capacity at a given point in time.

We’re able to increase throughput by reducing demands on the agent’s time and cognitive load, resulting in agents more efficiently handling conversations, while elevating the customer experience.

Measuring throughput

In equation form, throughput is the inverse of agent handle time (AHT) multiplied by the number of issues an agent can concurrently handle at once.

ASAPP—In equation form, throughput is the inverse of agent handle time (AHT) multiplied by the number of issues an agent can concurrently handle at once.

For example, if it on average takes an agent half an hour to handle an issue, and she handles two issues concurrently, then her throughput would be 4 issues per hour.

ASAPP—For example, if it on average takes an agent half an hour to handle an issue, and she handles two issues concurrently, then her throughput would be 4 issues per hour.

The equation shows two obvious ways to increase throughput:

  1. Reduce the time it takes to handle each individual issue (reduce the AHT); and
  2. Increase the number of issues an agent can concurrently handle.

At ASAPP, we think about these two approaches to increasing throughput, particularly as customers move to adopt more asynchronous communication.

Heather Reed
AHT as a metric is only applicable when the agent handles one contact at a time—and it’s completed end-to-end in one session. It doesn’t take into account concurrent digital interactions, nor asynchronous interactions.

Heather Reed, PhD

Reducing AHT

The first piece of the throughput-maximization problem entails identifying, quantifying, and reducing the time and effort required for agents to perform the tasks to solve a customer issue.
We think of the total work performed by an agent as both a function of the cognitive load (CL) and the time required to perform a task. This definition of work is analogous to the definition of work in physics, where Work = (Load applied to an object) X (Distance to move the object).

The agents’ cognitive load during the conversations (visualized by the height of the black curve and the intensity of the green bar) are affected by:

  • crafting messages to the customer;
  • looking up external information for the customer;
  • performing work on behalf of the customer;
  • context switching among multiple customers; etc.

The total work performed is the area under the curve, which can be reduced by decreasing the effort (CL) and time to perform tasks. We can compute the average across the interaction—a flat line—and in a synchronous environment, that can be very accurate.

ASAPP—The cognitive load varies throughout the duration of the issue, as shown by the height of the curve and the intensity of the green color. The total work performed is the multiplication of the cognitive load and the time to perform the task
The cognitive load varies throughout the duration of the issue, as shown by the height of the curve and the intensity of the green color. The total work performed is the multiplication of the cognitive load and the time to perform the task

ASAPP automation and agent augmentation features are designed to both reduce handling time and reduce the agents’ cognitive load—the amount of energy it takes to solve a customers’ problem or upsell a prospect. For example Autosuggest provides message recommendations that contain relevant customer information, saving agents the time and effort they would need to spend looking up information about customers (e.g. their bill amount) as well as the time spent physically crafting the message.

For synchronous conversations, that means each call is less tiring. For asynchronous conversations, that means agents can handle an increasing number of issues without corresponding increases in stress.

In some cases, we can completely eliminate the cognitive load from a part of a conversation. Our auto-pilot feature enables automation of entire portions of the interaction—for example, collecting customer’s device information, freeing up agents’ attention.

ASAPP—Augmentation and automation features reduce time and CL to perform tasks during an issue
Augmentation and automation features reduce time and CL to perform tasks during an issue

The result of use of multiple augmentation features during an issue is the reduction of overall AHT as well as reduction of work.

When the customer is asynchronous, the majority of the agent’s time would be spent waiting for the customer to respond. This is not an effective use of the agent’s time, which brings us to the second piece of the throughput-maximization problem.

Increasing concurrency

We can improve agent throughput by increasing concurrency. Unfortunately, this is more complex than simply increasing the number of issues assigned to an agent at once. Issues that escalate to agents are complex and emotive, as customers typically get basic needs met through self-service or automation. If an agent’s concurrency is increased without forecasting workload, then increasing concurrency will actually have an adverse effect on the AHT of individual issues.

If increasing concurrency results in increased AHT, then the impact on overall throughput can be negative. What’s more customers can become frustrated at the lack of response from the agent and bounce to other support channels, or worse—consider switching providers; and agents may feel overwhelmed and risk burning out or churning out

Flexible concurrency

We can alleviate this problem with flexible concurrency: an AI-driven approach to this problem. A machine learning model keeps track of the work the agent is doing, and dynamically increases an agent’s concurrency to keep the cognitive load manageable.

Combined with ASAPP augmentation features, our flexible concurrency model can safely increase an agent’s concurrency, enabling higher throughput and increased agent efficiency.

Without ASAPP—A visual comparison of agent throughput without (top) and with (bottom) ASAPP augmentation and flexible concurrency AI models. With ASAPP, the agent is able to handle several more customer issues concurrently because work required to resolve each issue is reduced.
ASAPP—A visual comparison of agent throughput without (top) and with (bottom) ASAPP augmentation and flexible concurrency AI models. With ASAPP, the agent is able to handle several more customer issues concurrently because work required to resolve each issue is reduced.
A visual comparison of agent throughput without (top) and with (bottom) ASAPP augmentation and flexible concurrency AI models. With ASAPP, the agent is able to handle several more customer issues concurrently because work required to resolve each issue is reduced.

In summary

As customers increasingly prefer to interact asynchronously, AHT becomes less appropriate for operations planning. Throughput (the number of issues within a time period) is a better metric to measure agent efficiency and manage workforce and operations budgets. ASAPP AI-driven agent augmentation paired with a flexible concurrency model enables our customers to safely increase agent throughput while maintaining manageable agent workload—and still deliver an exceptional customer experience.

AI Native®
Customer Experience
Articles
Why ASAPP

Gartner Recognizes ASAPP for Continuous Intelligence in CX

by 
Macario Namie
Article
Video
Jun 2
2 mins

Every year Gartner scans the horizons for companies who offer technology or services that are innovative, impactful, or intriguing. Gartner analysts might ask themselves: What’s something that customers could not do before? What technical innovation is focused on producing business impact? Or what new technology or service appears to be addressing systemic challenges?

This year’s Gartner report naming ASAPP as a “Cool Vendor” affirms our efforts at the intersection of artificial intelligence (AI) and customer experience (CX). We entered into this $600 billion industry because we wanted to create real change—building machine learning products that augment and automate the world’s workflows—and address the most costly and painful parts of CX that are largely ignored today.

Despite billions of dollars spent on technology designed to keep customers away from speaking with agents—starting with IVRs a few decades ago and most recently, chatbots—the human agent is still there. And in record numbers. Most large B2C organizations have actually increased their agent population over the last several years. And it is these human agents, the ones who represent your brand to millions of customers, who have been most ignored by innovators.

Macario Namie
By embracing automation—not as a replacement, but as augmentor—to human agents, the entire performance of sales and service contact centers is dramatically elevated.

Macario Namie

As ASAPP followers know well, this is why we exist. By embracing automation—not as a replacement, but as augmentor—to human agents, the entire performance of sales and service contact centers is dramatically elevated. Real-time continuous intelligence techniques are used to tell every agent the right thing to say and do, live during an interaction. The company benefits from radical increases in organizational productivity, while the customers get exactly what they want—the right answer in the fastest possible time.

We’re proud of the academic recognition ASAPP Research achieves for advancing the state of the art of automatic speech recognition (ASR), NLP, and Task-Oriented Dialogue. However, it’s the business results of this applied research that keeps ASAPP moving forward. We celebrate this Gartner recognition with our customers like American Airlines, Dish and JetBlue, who are seeing the business results of AI in their customer service.

So what makes a company applying artificial intelligence for customer experience a “Cool Vendor?” Well, check out the Gartner report. However, I would say it’s our exclusive focus on human performance within CX. Learn more by reading this year’s Gartner Cool Vendor report.

GARTNER DOES NOT ENDORSE ANY VENDOR, PRODUCT OR SERVICE DEPICTED IN ITS RESEARCH PUBLICATIONS, AND DOES NOT ADVISE TECHNOLOGY USERS TO SELECT ONLY THOSE VENDORS WITH THE HIGHEST RATINGS OR OTHER DESIGNATION. GARTNER RESEARCH PUBLICATIONS CONSIST OF THE OPINIONS OF GARTNER’S RESEARCH ORGANIZATION AND SHOULD NOT BE CONSTRUED AS STATEMENTS OF FACT. GARTNER DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT TO THIS RESEARCH, INCLUDING ANY WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Measuring Success
R&D Innovations
Transcription
Articles

Task-oriented dialogue systems could be better. Here’s a new dataset to help.

by 
Derek Chen
Article
Video
May 26
2 mins

Dialogue State Tracking has run its course. Here’s why Action State Tracking and Cascading Dialogue Success is next.

For call center applications, dialogue state tracking (DST) has traditionally served as a way to determine what the user wants at that point in the dialogue. However, in actual industry use cases, the work of a call center agent is more complex than simply recognizing user intents.

In real world environments, agents are typically tasked with strenuous multitasking. Tasks often include reviewing knowledge base articles, evaluating guidelines in what can be said, examining dialogue history with a customer, and inspecting customer account details all at once. In fact, according to ASAPP internal research, call center phone agents spend approximately 82 percent of their total time looking at customer data, step-by-step guides, or knowledge base articles. Yet none of these aspects are accounted for in classical DST benchmarks. A more realistic environment would employ a dual-constraint where the agent needs to obey customer requests while considering company policies when taking actions.

That’s why, in order to improve the state of the art of task-oriented dialogue systems for customer service applications, we’re establishing a new Action-Based Conversations Dataset (ABCD). ABCD is a fully-labeled dataset with over 10k human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by company policies to achieve task success.

The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests. With this dataset, we propose two new tasks: Action State Tracking (AST)—which keeps track of the state of the dialogue when we know that an action has taken place during that turn; and Cascading Dialogue Success (CDS)—a measure for the model’s ability to understand actions in context as a whole, which includes the context from other utterances.

Derek Chen
The major difference between ABCD and other datasets is that it asks the agent to adhere to a set of policies that call center agents often face, while simultaneously dealing with customer requests.

Derek Chen

Dataset Characteristics

Unlike other large open-domain dialogue datasets often built for more general chatbot entertainment purposes, ABCD focuses deeper on increasing the count and diversity of actions and text within the domain of customer service. Dataset participants were additionally incentivized through financial bonuses when properly adhering to policy guidelines in handling customer requests, mimicking customer service environments and realistic agent behavior.

The training process to annotate the dataset, for example, at times felt like training for a real call center role. “I feel like I’m back at my previous job as a customer care agent in a call center,” said one MTurk agent who was involved in the study. “Now I feel ready to work at or interview for a real customer service role,” said another.

New Benchmarks

The novel features in ABCD challenges the industry to measure performance across two new dialogue tasks: Action State Tracking & Cascading Dialogue Success.

Action State Tracking (AST)

AST improves upon DST metrics by detecting the pertinent intent from customer utterances while also taking into account constraints from agent guidelines. Suppose a customer is entitled to a discount which will be offered by issuing a [Promo Code]. The customer might request 30% off, but the guidelines stipulate only 15% is permitted, which would make “30” a reasonable, but ultimately flawed slot-value. To measure a model’s ability to comprehend such nuanced situations, we adopt overall accuracy as the evaluation metric for AST.

Cascading Dialogue Success (CDS)

Since the appropriate action often depends on the situation, we propose the CDS task to measure a model’s ability to understand actions in context. Whereas AST assumes an action occurs in the current turn, the task of CDS includes first predicting the type of turn and its subsequent details. The types of turns are utterances, actions, and endings. When the turn is an utterance, the detail is to respond with the best sentence chosen from a list of possible sentences. When the turn is an action, the detail is to choose the appropriate slots and values. Finally, when the turn is an ending, the model should know to end the conversation. This score is calculated on every turn, and the model is evaluated based on the percent of remaining steps correctly predicted, averaged across all available turns.

Why This Matters

For customer service and call center applications, it is time for both the research community and industry to do better. Models relying on DST as a measure of success have little indication of performance in real world scenarios, and discerning CX leaders should look to other indicators grounded in the conditions that actual call center agents face.

Rather than relying on general datasets which expand upon an obtuse array of knowledge base lookup actions, ABCD presents a corpus for building more in-depth task-oriented dialogue systems. The availability of this dataset and two new tasks creates new opportunities for researchers to explore better, more reliable, models for task-oriented dialogue systems.

We can’t wait to see what the community creates from this dataset. Our contribution to the field with this dataset is another major step to improving machine learning models in customer service.

Read the Complete Paper, & Access the Dataset

This work has been accepted at NAACL 2021. Meet the authors on June 8th, 20:00—20:50 EST, where this work will be presented as a part of “Session 9A-Oral: Dialogue and Interactive Systems.”

R&D Innovations
Transcription
Articles

Why a little increase in transcription accuracy is such a big deal

by 
Austin Meyer
Article
Video
May 24
2 mins

A lot has been written lately about the importance of accuracy in speech-to-text transcription. It’s the key to unlocking value from the phone calls between your agents and your customers. For technology evaluators, it’s increasingly difficult to cut through the accuracy rates being marketed by vendors—from the larger players like Google and Amazon to smaller niche providers. How do you determine the best transcription engine for your organization to unlock the value of transcription?

The reality is that there is no one speech transcription model to rule them all. How do we know? We tried them.

In our own testing some models performed extremely well in industry benchmarks. But then they failed to reproduce even close to the same results when put into production contact center environments.

Benchmarks like Librispeech use a standardized set of audio files which speech engineers optimize for on many different dimensions (vocabulary, audio type, accents, etc). This is why we see WERs in the <2% range. These models are now outperforming the human ear (4% WER) on the same data which is an incredible feat of engineering. Doing well on industry benchmarks is impressive—but what evaluators really need to know is how these models perform in their real-world environment.

What we’ve found in our own testing is that most off-the-shelf Automatic Speech Recognition (ASR) models struggle with different contact center telephony environments and the business specific terminology used within those conversations. Before ASAPP, many of our customers were able to get transcription live after months of integration and even utilized domain specific ASRs, but only saw accuracy rates in the area of 70%, nudging closer to 80% only in the most ideal conditions. That is certainly a notch above where it was 5 or 10 years ago, but most companies still don’t transcribe 100% of phone calls. Why? Because they don’t expect to get enough value to justify the cost.

So how much value is there in a higher real-world accuracy rate?

AustinMeyer
The words that are missed in the gap between 80% accuracy and 90% accuracy are often the ones that matter most. They’re the words that are specific to the business and are critical to unlocking value.

Austin Meyer

More than you might imagine. Words that are missed are often the most important ones—specific to the business and are critical to unlocking value. These would be things like:

  • Company names (AT&T, Asurion, Airbnb)
  • Product and promotion names (Galaxy S12, MLB League Pass, Marriott Bonvoy Card)
  • People’s names, emails and addresses
  • Long numbers such as serial numbers and account numbers
  • Dollar amounts and and dates

To illustrate this point, let’s look at a sample of 10,000 hours of transcribed audio from a typical contact center. There are roughly 30,000 unique words within those transcripts, yet the system only needs to recognize 241 of the most frequently used words to get 80% accuracy. Those are largely words like “the”, “you”, “to”, “what”, and so on.

To get to 90% accuracy, the system needs to correctly transcribe the next 324 most frequently used words, and even more for every additional percent. These are often words that are unique to your business—the words that really matter.

ASAPP—The words that are missed in the gap between 80% accuracy and 90% accuracy are often the ones that matter most. They’re the words that are specific to the business and are critical to unlocking value.

Context also impacts accuracy and meaning. If someone says, “Which Galaxy is that?”, depending on the context, they could be talking about a Samsung phone or a collection of stars and planets. This context will often impact the spelling and capitalization of many important words.

Taking this even further, if someone says, “my Galaxy is broken”, but they don’t mention which model they have, anyone analyzing those transcripts to determine which phone models are problematic won’t know unless that transcript is tied to additional data about that customer. The effort of manuallying integrating transcripts to other datasets that contain important context dramatically increases the cost of getting value from transcription.

When accuracy doesn’t rise above 80% in production and critical context is missing, you get limited value from your data– nothing more than simple analytics like high level topic/intent categorization, maybe tone, basic keywords, and questionable sentiment scores. That’s not enough to significantly impact the customer experience or the bottom line.

It’s no wonder companies can’t justify transcribing 100% of their calls despite the fact that many of them know there is rich data there.

The key to mining the rich data that’s available in your customer conversations—and to getting real value from transcribing every word of every call is threefold:

  • Make sure you have an ASR model that’s custom tuned and continuously adapts to the lexicon of your business.
  • Connect your transcripts to as much contextual metadata as possible.
  • Have readily accessible tools to analyze data and act on insights in ways that create significant impact for your business—both immediately and long term.

ASAPP answers that call. When you work with us you’ll get a custom ASR model trained on your data to transcribe conversation in real time, and improve with every call. Our AI-driven platform will deliver immediate business value through an array of CX-focused capabilities, fed by customer conversations and relevant data from other systems. Plus, it provides a wealth of voice of the customer data that can be used across your business. When you see your model in action and the tremendous value you get with our platform, it makes a lot more sense to transcribe every word of every call. Interested? Send us a note at ask@asapp.com and we’ll be happy to show you how it’s done.

AI Native®
Concurrency
Customer Experience
Articles
CX & Contact Center Insights

An urgent case to support contact center agents with AI built for them

by 
Rachel Knaster
Article
Video
May 10
2 mins

A colleague describes customer service as the heartbeat of a company. I am yet to think of a better description. And in the midst of this global pandemic, that heartbeat is working at about 200 beats per minute.

Why are customer service agents under so much strain?

There are a variety of factors putting pressure on customer service organizations:

  • Volume of questions / calls / chats from customers is at an unprecedented high.
  • The responses to their questions are changing daily as this situation unfolds.
  • Many agents have been relocated to work from home.
  • Many agents are unable to get into work and cannot work from home, so total staffing is lower.
  • Customers are scared and frustrated (after long wait times). They need answers to their questions and more than ever, they want to hear those answers from a human.
Rachel Knaster
During this crazy time you can either let that heartbeat keep going up until it can no longer do what’s needed, or you can provide the necessary tools to make sure it can keep supporting the other organs / functions.

Rachel Knaster

Why isn’t anyone helping?

Unfortunately the trend in this space over the last several years has been to “contain” or “deflect” customers from connecting with agents. While AI and ML have become familiar terms within contact centers, the primary use has been to engage bots— aimed at preventing as many customers as possible from talking to agents.

How can you help your agents?

Our philosophy on AI and ML in this space is: Let’s use this powerful technology to augment the humans. Let’s allow conversations between customers and agents, learn from them, and use those learnings to drive better, more efficient interactions. This philosophy rings through our platform from our proprietary language models, to our intuitive UI/UX, to our ongoing engagement with agents through focus groups and roundtables to make sure what we are building is working for them.

Why focusing on agents is most important

  1. It drives the best results: Increased agent efficiency with increased customer AND agent satisfaction.
  2. Agents are the bottleneck right now.
  3. Your agents are on the front line — an important face of your brand to your customers.
  4. Better performing agents lead to happier customers.
  5. Agents provide the best feedback loop for what works and what doesn’t work.
  6. During this crazy time you can either let that heartbeat keep going up until it can no longer do what’s needed, or you can provide the necessary tools to make sure it can keep supporting the other organs / functions.

Are you doing everything you can to support your agents so they can serve your customers well?

R&D Innovations
Articles

Reduce wait times, increase CSAT scores. But, how?

by 
Denton Zhao
Article
Video
May 7
2 mins

Customer satisfaction (CSAT) scores are an indicator of customer loyalty and confidence. It is reasonable to assume that CSAT scores play an important factor in reducing customer churn, increasing recurring revenue, and increasing customer lifetime value.

We analyzed customer and agent chat interactions for factors that impact customer satisfaction (CSAT) scores. Negative CSAT scores are directly correlated with four main factors, three of them specifically around wait time:

  1. Customers are put in a queue to wait to speak to an agent.
  2. Customers wait for agents to respond.
  3. Customers need to be transferred to another agent after speaking with an initial agent.
  4. Customers using digital communications are timed-out from their chat due to inactivity and must start over

The results show: Wait time significantly impacts CSAT scores

We reviewed CSAT scores against the customer experience for more than 17,000 interactions. As CSAT scores are broken down into a 5 point scale, scores between one and three were consolidated into the negative class, whereas the top two scores were consolidated into the positive class.

ASAPP—As customers are enqued for a longer period of time (x-axis) the favorable CSAT rate decreases.
ASAPP—Agents typically take 20—60 seconds(x-axis) to respond to a message from the customer. Favorable CSAT rates (y-axis) decrease as agents take longer to respond.
ASAPP—A symulation of CSAT scores was conducted by taking repeated samples of conversations with agent transfers (blue) and conversations without transfers (pink). Note the difference in the number of positive CSAT scores observed (y-axis).

Negative CSAT rates (scores between 1 and 3) only occur 20% of the time, but when a customer is timed out, the negative CSAT rate jumps up to 80%.

ASAPP—Negative CSAT rates (scores between 1 and 3) only occur 20% of the time, but when a customer is timed out, the negative CSAT rate jumps up to 80%

How ASAPP provides opportunities for higher CSAT scores

Directed automation features such as “agent autosuggest” and automated conversation summary notes reduce agent response times. And, AI-driven knowledge base article retrieval models help agents streamline the troubleshooting process. This has the benefit of reducing current customer wait times for agent response, but also improves throughput, reducing queue times as well.

It is important that customers get to an agent that can actually solve their problems. ASAPP intent classification sorts conversations into different types based on the initial set of utterances by the customer. This classification helps match each customer with an appropriate agent and reduces the need for multiple transfers.

A queue check-in feature checks to see if a queued customer is still available before routing to an agent. This eliminates having the agent spending time to connect when a customer has vacated the line.

As agents gain efficiency and communicate on asynchronous channels they’re able to handle multiple issues at once, further reducing enqueuement times. Small gains in efficiency on an individual conversation level add up to larger effects on throughput—for each agent and for the whole CX team.

R&D Innovations
Transcription
Articles

Addressing instabilities for few-sample BERT fine-tuning

by 
Felix Wu
Article
Video
Apr 29
2 mins

The costs of BERT Fine-Tuning on small datasets

Fine-tuning BERT or its variants has become one of the most popular and effective methods to tackle natural language processing tasks, especially those with limited data. BERT models have been downloaded more than 5.6 millions of times from Huggingface’s public server.

However, fine-tuning remains unstable, especially when
using the large variant of BERT (BERTLarge) on small datasets, arguably the most impactful use of BERT-style models. Identical learning processes with different random seeds often result in significantly different and sometimes degenerate models following fine-tuning, even though only a few, seemingly insignificant aspects of the learning process are impacted by the random seed (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020). In layman’s terms: every time you train BERT for your task, you get different results. This means you need to train again and again to get a good system. This makes scientific comparison challenging (Dodge et al., 2020) and creates huge costs, which are potentially unnecessary.

While the variance comes from randomness, we hypothesize that the major cause of this instability lies in the optimization process.

Revisiting Few-sample BERT Fine-tuning

We conducted an extensive empirical analysis of BERT fine-tuning optimization behaviors on three aspects to identify the root cause of instability:

  1. The Optimization Algorithm
  2. We found that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the main cause of degenerate models.
  3. The Initialization
  4. We found that re-initializing the top few layers of BERT stabilizes the fine-tuning procedure.
  5. The Number of Training Iterations
  6. We found that the model still requires hundreds of updates to converge.

1. Optimization Algorithm

We observed that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the lead cause of degenerate fine-tuning runs. The following is a pseudo-code of the Adam algorithm (Kingma & Ba, 2014). BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

ASAPP—BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

Fine-tuning BERT with the original Adam (with bias correction) eradicates almost all degenerate model training outcomes and reduces the variance across multiple randomized trials. Here, we show the test performance distribution of 20 random trials with or without bias correction on four small datasets.

ASAPP—Here, we show the test performance distribution of 20 random trails with or without bias correction on four small datasets.

Since the variance is significantly reduced, practitioners can easily get a decent model within only one to five trials instead of fine-tuning up to 20 models and picking the best one.

2. Initialization

We hypothesized that the top pre-trained layers of BERT are specific to the pre-training task and may not transfer to a dissimilar downstream task. We propose to re-initialize the top few layers of BERT to ease the fine-tuning procedure. We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers.

ASAPP—We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers

The following figure shows the validation performance with different numbers of re-initialized layers. As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks.

ASAPP—As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks

3. Number of Training Iterations

ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge.

ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge

Revisiting Existing Methods for Few-sample BERT Fine-tuning

Instability in BERT fine-tuning, especially in few-sample settings, has been receiving significant attention recently. We revisited these methods given our analysis of the fine-tuning process, focusing on the impact of using the debiased Adam instead of BERTAdam.

To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. “Int. Task” stands for transferring via an intermediate task (MNLI), “LLRD” stands for layerwise learning rate decay, “WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

ASAPP—To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. ``Int. Task” stands for transferring via an intermediate task (MNLI), ``LLRD” stands for layerwise learning rate decay, ``WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

We found that the standard fine-tuning procedure using bias-corrected Adam already has a fairly small variance, making these more complex techniques largely unnecessary. Moreover, re-initialization and training longer can serve as simple yet hard to beat baselines that outperforms previous methods except “Int. Task’’ on RTE. The reason is that RTE is very similar to MNLI (the intermediate task).

Why this work matters

This work carefully investigates the current, broadly adopted optimization practices in BERT fine-tuning. Our findings significantly stabilize BERT fine-tuning on small datasets. Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Our findings are focused on few-sample training scenarios, which opens, or at least eases the way for new applications at reduced data costs. The reduction in cost broadens the accessibility and reduces the energy footprint of BERT-based models. Applications that require frequent re-training are now easier and cheaper to deploy given the reduced training costs. This work also simplifies the scientific comparison between future fine-tuning methods by making training more stable, and therefore easier to reproduce.

Felix Wu
Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Felix Wu, PhD

Read The Complete Paper:

This work has been accepted and will be published in ICLR 2021. Visit our poster during the virtual conference—Poster Session 2: May 3, 2021, 9 a.m. PDT & May 3, 2021, 11 a.m. PDT—to have some conversations with the authors.

Automation
Customer Experience
Future of CX
Articles

Four AI-centered CX insights from fortune 500 CXOs

by 
Michael Lawder
Article
Video
Apr 23
2 mins

In recent years, the rapid advance of new data and mobile applications has created a heightened threshold for customer expectations. ASAPP and a suite of forward-looking Chief Experience Officers (CXOs) who represent companies with over $450 billion in market value met to discuss how machine learning (ML), speech recognition, and natural language processing (NLP) are generating higher agent productivity, efficiency, and cost reductions.

Four key AI-centered insights arose from the CXOs who see 2021 as an opportunity to realize the promise of artificial intelligence to radically improve customer experience (CX).

1—Automation creates opportunity for emotive, human-driven service

On the surface, “automation” and “human-driven” seem like two opposing forces. A legacy approach considers automation solely to enable customer self-service, taking human agents out of the customer journey. While self-service will persist in specific applications, automating contact center agents’ repetitive tasks allows a focus on what matters most: providing excellent customer service and representing a brand positively.

AI is opening new avenues to create personalized experiences for customers. With an AI platform, agents know what the customer is facing at a given time, their history with the company, and their communication preferences. Automating their administrative tasks in note-taking and multitasking enables agents to be stronger brand ambassadors in spending more mental energy providing an emotive, high-touch, response to customer needs.

2—AI-driven real-time insights is the next big opportunity for supervisors and coaches

Previously, an in-person presence at call centers afforded managers the ability to monitor and assist agents shoulder-to-shoulder. But in today’s digital workplace, managers have turned to less streamlined methods of using webcam and Slack to support agents. This approach has made it harder for managers to supervise and coach teams, and the introduction of new digital systems has added increasing complexity for front-line agents.

CXOs are beginning to see the promise of ML, NLP, and automatic speech recognition technologies to power live voice transcription. These AI technologies enable managers to supervise and support agents in real-time, guiding agents at the moment they need assistance. After each customer engagement, ML-generated reports and summaries allow managers to digest previous interactions, understand where agents are facing challenges, and improve agent performance. With the AI analyzed data, managers can adjust strategy and coaching in real-time to nimbly respond to the business challenges they face.

In the near future, CXOs expect the confluence of ML, NLP, and automatic speech recognition technologies to provide insight for the next golden opportunity: determining caller intent to more rapidly detect what a caller needs, assess their emotional state, and have them automatically routed to the appropriate agent.

Michael Lawder
CXOs are excited by the opportunities AI presents. They expect this technology to help their organizations be much more productive and at the same time, differentiate themselves by providing exceptional customer experience.

Michael Lawder

3—Measure what matters for holistic data-driven decision making

Thanks to the advance of ML, businesses are able scale pattern recognition and automation from their own data. In 2021, the businesses we speak to are going beyond “bean-counting” to unearth correlation-driven insights for strategic business decisions. Outliers and anecdotes are steadily coming together to illustrate, for example, that mobile device users are more willing to have synchronous conversations than desktop users—an insight which may affect routing processes. To detect these patterns, CX teams are looking to ensure that they have individuals with the knowledge to contextualize the data and to build systems to reliably measure it.

However, in the effort to become a digital-first business, building a comprehensive data lake remains a challenge. Businesses are still struggling to compile timely, quality data at a granularity that can be integrated with other data sets. The preservation, and architecture, of legacy systems has led to continued data silos that makes it hard for decision-makers to see the big picture in the customer journey. CX leaders should demand more from their IT teams and service providers to streamline this data to successfully arm businesses and teams to make changes.

And it’s not just technical IT teams who have a responsibility in building this data treasury. All employees have a role in ensuring that the business is flagging data for data driven decision making. The first step begins in making a cultural mind shift to view data as an important corporate asset.

4—Today’s AI and digital technology shouldn’t be used with yesterday’s paradigm

Many of the à la carte solutions found in today’s contact centers were built for a different time. In decades past, businesses relied on outsourcing to balance costs and scale service which often came at the cost of the customer experience. In the 2010s, IVRs and chatbots offered a way to triage workloads but rarely provided a stellar experience for customers. Today, many contact centers are left sustaining a costly myriad of legacy systems that were not designed for a cohesive customer experience. A real transformation to improve customer experiences requires a rethink of how the customer journey operates.

At ASAPP, we’re doing this by putting a focus on making people better with AI. This has meant a change in everything we create from the ground up for vertically integrated AI and human productivity. We’re changing how we measure ourselves, and interact with customers. For example, IVRs and legacy systems may deliver cost savings, but they may actually exacerbate customer frustration. An analogy I like to use when describing this new paradigm for CX is like building a train to fly. Instead of spending the significant and inefficient resourcing to make trains fly, at ASAPP, we’re building an airplane.

Chief Experience Officers are excited by a future driven by AI: making organizations highly productive and effective by augmenting human activity and automating the world’s workflows. I can’t wait to see what new insights we’ll unearth at our next meeting.

Want to see what makes people so excited to partner with us? Get in touch, or give me a tweet at @michael_lawder.

No results found.
No items found.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get Started

AI Services Value Calculator

Estimate your cost savings

contact us

Request a Demo

Transform your enterprise with generative AI • Optimize and grow your CX •
Transform your enterprise with generative AI • Optimize and grow your CX •