Blog

All
Browse by topic
R&D Innovations
Speech-to-text
Transcription
Articles

Why a little increase in transcription accuracy is such a big deal

by 
Austin Meyer
Article
Video
May 24
2 mins

A lot has been written lately about the importance of accuracy in speech-to-text transcription. It’s the key to unlocking value from the phone calls between your agents and your customers. For technology evaluators, it’s increasingly difficult to cut through the accuracy rates being marketed by vendors—from the larger players like Google and Amazon to smaller niche providers. How do you determine the best transcription engine for your organization to unlock the value of transcription?

The reality is that there is no one speech transcription model to rule them all. How do we know? We tried them.

In our own testing some models performed extremely well in industry benchmarks. But then they failed to reproduce even close to the same results when put into production contact center environments.

Benchmarks like Librispeech use a standardized set of audio files which speech engineers optimize for on many different dimensions (vocabulary, audio type, accents, etc). This is why we see WERs in the <2% range. These models are now outperforming the human ear (4% WER) on the same data which is an incredible feat of engineering. Doing well on industry benchmarks is impressive—but what evaluators really need to know is how these models perform in their real-world environment.

What we’ve found in our own testing is that most off-the-shelf Automatic Speech Recognition (ASR) models struggle with different contact center telephony environments and the business specific terminology used within those conversations. Before ASAPP, many of our customers were able to get transcription live after months of integration and even utilized domain specific ASRs, but only saw accuracy rates in the area of 70%, nudging closer to 80% only in the most ideal conditions. That is certainly a notch above where it was 5 or 10 years ago, but most companies still don’t transcribe 100% of phone calls. Why? Because they don’t expect to get enough value to justify the cost.

So how much value is there in a higher real-world accuracy rate?

AustinMeyer
The words that are missed in the gap between 80% accuracy and 90% accuracy are often the ones that matter most. They’re the words that are specific to the business and are critical to unlocking value.

Austin Meyer

More than you might imagine. Words that are missed are often the most important ones—specific to the business and are critical to unlocking value. These would be things like:

  • Company names (AT&T, Asurion, Airbnb)
  • Product and promotion names (Galaxy S12, MLB League Pass, Marriott Bonvoy Card)
  • People’s names, emails and addresses
  • Long numbers such as serial numbers and account numbers
  • Dollar amounts and and dates

To illustrate this point, let’s look at a sample of 10,000 hours of transcribed audio from a typical contact center. There are roughly 30,000 unique words within those transcripts, yet the system only needs to recognize 241 of the most frequently used words to get 80% accuracy. Those are largely words like “the”, “you”, “to”, “what”, and so on.

To get to 90% accuracy, the system needs to correctly transcribe the next 324 most frequently used words, and even more for every additional percent. These are often words that are unique to your business—the words that really matter.

ASAPP—The words that are missed in the gap between 80% accuracy and 90% accuracy are often the ones that matter most. They’re the words that are specific to the business and are critical to unlocking value.

Context also impacts accuracy and meaning. If someone says, “Which Galaxy is that?”, depending on the context, they could be talking about a Samsung phone or a collection of stars and planets. This context will often impact the spelling and capitalization of many important words.

Taking this even further, if someone says, “my Galaxy is broken”, but they don’t mention which model they have, anyone analyzing those transcripts to determine which phone models are problematic won’t know unless that transcript is tied to additional data about that customer. The effort of manuallying integrating transcripts to other datasets that contain important context dramatically increases the cost of getting value from transcription.

When accuracy doesn’t rise above 80% in production and critical context is missing, you get limited value from your data– nothing more than simple analytics like high level topic/intent categorization, maybe tone, basic keywords, and questionable sentiment scores. That’s not enough to significantly impact the customer experience or the bottom line.

It’s no wonder companies can’t justify transcribing 100% of their calls despite the fact that many of them know there is rich data there.

The key to mining the rich data that’s available in your customer conversations—and to getting real value from transcribing every word of every call is threefold:

  • Make sure you have an ASR model that’s custom tuned and continuously adapts to the lexicon of your business.
  • Connect your transcripts to as much contextual metadata as possible.
  • Have readily accessible tools to analyze data and act on insights in ways that create significant impact for your business—both immediately and long term.

ASAPP answers that call. When you work with us you’ll get a custom ASR model trained on your data to transcribe conversation in real time, and improve with every call. Our AI-driven platform will deliver immediate business value through an array of CX-focused capabilities, fed by customer conversations and relevant data from other systems. Plus, it provides a wealth of voice of the customer data that can be used across your business. When you see your model in action and the tremendous value you get with our platform, it makes a lot more sense to transcribe every word of every call. Interested? Send us a note at ask@asapp.com and we’ll be happy to show you how it’s done.

Agents
AI Native®
Concurrency
Customer Experience
Articles

An urgent case to support contact center agents with AI built for them

by 
Rachel Knaster
Article
Video
May 10
2 mins

A colleague describes customer service as the heartbeat of a company. I am yet to think of a better description. And in the midst of this global pandemic, that heartbeat is working at about 200 beats per minute.

Why are customer service agents under so much strain?

There are a variety of factors putting pressure on customer service organizations:

  • Volume of questions / calls / chats from customers is at an unprecedented high.
  • The responses to their questions are changing daily as this situation unfolds.
  • Many agents have been relocated to work from home.
  • Many agents are unable to get into work and cannot work from home, so total staffing is lower.
  • Customers are scared and frustrated (after long wait times). They need answers to their questions and more than ever, they want to hear those answers from a human.
Rachel Knaster
During this crazy time you can either let that heartbeat keep going up until it can no longer do what’s needed, or you can provide the necessary tools to make sure it can keep supporting the other organs / functions.

Rachel Knaster

Why isn’t anyone helping?

Unfortunately the trend in this space over the last several years has been to “contain” or “deflect” customers from connecting with agents. While AI and ML have become familiar terms within contact centers, the primary use has been to engage bots— aimed at preventing as many customers as possible from talking to agents.

How can you help your agents?

Our philosophy on AI and ML in this space is: Let’s use this powerful technology to augment the humans. Let’s allow conversations between customers and agents, learn from them, and use those learnings to drive better, more efficient interactions. This philosophy rings through our platform from our proprietary language models, to our intuitive UI/UX, to our ongoing engagement with agents through focus groups and roundtables to make sure what we are building is working for them.

Why focusing on agents is most important

  1. It drives the best results: Increased agent efficiency with increased customer AND agent satisfaction.
  2. Agents are the bottleneck right now.
  3. Your agents are on the front line — an important face of your brand to your customers.
  4. Better performing agents lead to happier customers.
  5. Agents provide the best feedback loop for what works and what doesn’t work.
  6. During this crazy time you can either let that heartbeat keep going up until it can no longer do what’s needed, or you can provide the necessary tools to make sure it can keep supporting the other organs / functions.

Are you doing everything you can to support your agents so they can serve your customers well?

R&D Innovations
Articles

Reduce wait times, increase CSAT scores. But, how?

by 
Denton Zhao
Article
Video
May 7
2 mins

Customer satisfaction (CSAT) scores are an indicator of customer loyalty and confidence. It is reasonable to assume that CSAT scores play an important factor in reducing customer churn, increasing recurring revenue, and increasing customer lifetime value.

We analyzed customer and agent chat interactions for factors that impact customer satisfaction (CSAT) scores. Negative CSAT scores are directly correlated with four main factors, three of them specifically around wait time:

  1. Customers are put in a queue to wait to speak to an agent.
  2. Customers wait for agents to respond.
  3. Customers need to be transferred to another agent after speaking with an initial agent.
  4. Customers using digital communications are timed-out from their chat due to inactivity and must start over

The results show: Wait time significantly impacts CSAT scores

We reviewed CSAT scores against the customer experience for more than 17,000 interactions. As CSAT scores are broken down into a 5 point scale, scores between one and three were consolidated into the negative class, whereas the top two scores were consolidated into the positive class.

ASAPP—As customers are enqued for a longer period of time (x-axis) the favorable CSAT rate decreases.
ASAPP—Agents typically take 20—60 seconds(x-axis) to respond to a message from the customer. Favorable CSAT rates (y-axis) decrease as agents take longer to respond.
ASAPP—A symulation of CSAT scores was conducted by taking repeated samples of conversations with agent transfers (blue) and conversations without transfers (pink). Note the difference in the number of positive CSAT scores observed (y-axis).

Negative CSAT rates (scores between 1 and 3) only occur 20% of the time, but when a customer is timed out, the negative CSAT rate jumps up to 80%.

ASAPP—Negative CSAT rates (scores between 1 and 3) only occur 20% of the time, but when a customer is timed out, the negative CSAT rate jumps up to 80%

How ASAPP provides opportunities for higher CSAT scores

Directed automation features such as “agent autosuggest” and automated conversation summary notes reduce agent response times. And, AI-driven knowledge base article retrieval models help agents streamline the troubleshooting process. This has the benefit of reducing current customer wait times for agent response, but also improves throughput, reducing queue times as well.

It is important that customers get to an agent that can actually solve their problems. ASAPP intent classification sorts conversations into different types based on the initial set of utterances by the customer. This classification helps match each customer with an appropriate agent and reduces the need for multiple transfers.

A queue check-in feature checks to see if a queued customer is still available before routing to an agent. This eliminates having the agent spending time to connect when a customer has vacated the line.

As agents gain efficiency and communicate on asynchronous channels they’re able to handle multiple issues at once, further reducing enqueuement times. Small gains in efficiency on an individual conversation level add up to larger effects on throughput—for each agent and for the whole CX team.

AI Research
R&D Innovations
Speech-to-text
Transcription
Articles

Addressing instabilities for few-sample BERT fine-tuning

by 
Felix Wu
Article
Video
Apr 29
2 mins

The costs of BERT Fine-Tuning on small datasets

Fine-tuning BERT or its variants has become one of the most popular and effective methods to tackle natural language processing tasks, especially those with limited data. BERT models have been downloaded more than 5.6 millions of times from Huggingface’s public server.

However, fine-tuning remains unstable, especially when
using the large variant of BERT (BERTLarge) on small datasets, arguably the most impactful use of BERT-style models. Identical learning processes with different random seeds often result in significantly different and sometimes degenerate models following fine-tuning, even though only a few, seemingly insignificant aspects of the learning process are impacted by the random seed (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020). In layman’s terms: every time you train BERT for your task, you get different results. This means you need to train again and again to get a good system. This makes scientific comparison challenging (Dodge et al., 2020) and creates huge costs, which are potentially unnecessary.

While the variance comes from randomness, we hypothesize that the major cause of this instability lies in the optimization process.

Revisiting Few-sample BERT Fine-tuning

We conducted an extensive empirical analysis of BERT fine-tuning optimization behaviors on three aspects to identify the root cause of instability:

  1. The Optimization Algorithm
  2. We found that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the main cause of degenerate models.
  3. The Initialization
  4. We found that re-initializing the top few layers of BERT stabilizes the fine-tuning procedure.
  5. The Number of Training Iterations
  6. We found that the model still requires hundreds of updates to converge.

1. Optimization Algorithm

We observed that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the lead cause of degenerate fine-tuning runs. The following is a pseudo-code of the Adam algorithm (Kingma & Ba, 2014). BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

ASAPP—BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

Fine-tuning BERT with the original Adam (with bias correction) eradicates almost all degenerate model training outcomes and reduces the variance across multiple randomized trials. Here, we show the test performance distribution of 20 random trials with or without bias correction on four small datasets.

ASAPP—Here, we show the test performance distribution of 20 random trails with or without bias correction on four small datasets.

Since the variance is significantly reduced, practitioners can easily get a decent model within only one to five trials instead of fine-tuning up to 20 models and picking the best one.

2. Initialization

We hypothesized that the top pre-trained layers of BERT are specific to the pre-training task and may not transfer to a dissimilar downstream task. We propose to re-initialize the top few layers of BERT to ease the fine-tuning procedure. We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers.

ASAPP—We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers

The following figure shows the validation performance with different numbers of re-initialized layers. As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks.

ASAPP—As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks

3. Number of Training Iterations

ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge.

ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge

Revisiting Existing Methods for Few-sample BERT Fine-tuning

Instability in BERT fine-tuning, especially in few-sample settings, has been receiving significant attention recently. We revisited these methods given our analysis of the fine-tuning process, focusing on the impact of using the debiased Adam instead of BERTAdam.

To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. “Int. Task” stands for transferring via an intermediate task (MNLI), “LLRD” stands for layerwise learning rate decay, “WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

ASAPP—To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. ``Int. Task” stands for transferring via an intermediate task (MNLI), ``LLRD” stands for layerwise learning rate decay, ``WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

We found that the standard fine-tuning procedure using bias-corrected Adam already has a fairly small variance, making these more complex techniques largely unnecessary. Moreover, re-initialization and training longer can serve as simple yet hard to beat baselines that outperforms previous methods except “Int. Task’’ on RTE. The reason is that RTE is very similar to MNLI (the intermediate task).

Why this work matters

This work carefully investigates the current, broadly adopted optimization practices in BERT fine-tuning. Our findings significantly stabilize BERT fine-tuning on small datasets. Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Our findings are focused on few-sample training scenarios, which opens, or at least eases the way for new applications at reduced data costs. The reduction in cost broadens the accessibility and reduces the energy footprint of BERT-based models. Applications that require frequent re-training are now easier and cheaper to deploy given the reduced training costs. This work also simplifies the scientific comparison between future fine-tuning methods by making training more stable, and therefore easier to reproduce.

Felix Wu
Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Felix Wu, PhD

Read The Complete Paper:

This work has been accepted and will be published in ICLR 2021. Visit our poster during the virtual conference—Poster Session 2: May 3, 2021, 9 a.m. PDT & May 3, 2021, 11 a.m. PDT—to have some conversations with the authors.

AI Research
Automation
R&D Innovations
Speech-to-text
Transcription

Filling in the missing pieces for automation

by 
Yoav Artzi
Article
Video
Apr 23
2 mins

Natural language classification is widely adopted in many applications, such as user intent prediction, recommendation systems, and information retrieval. At ASAPP, natural language classification is a core technique behind our automation capabilities.

Conventional classification takes a single-step user query and returns a prediction. However, natural language input from consumers can be underspecified and ambiguous, for example when they are not experts in the domain.

Yoav Artzi
Natural language input can be hard to classify, but it’s critical to get classification right for accurate automation. Our research goes beyond conventional methods, and builds systems that interact with users to give them the best outcome.

Yoav Artzi

For example, in an FAQ suggestion application, a user may issue the query “travel out of country”. The classifier will likely find multiple relevant FAQ candidates, as seen in the figure below. In such scenarios, a conventional classifier will just return one of the predictions, even if it is uncertain and the prediction may not be ideal.

ASAPP—For example, in an FAQ suggestion application, a user may issue the query “travel out of country”. The classifier will likely find multiple relevant FAQ candidates, as seen in the figure below.

We solve this challenge by collecting missing information from users to reduce ambiguity and improve the model prediction performance. My colleagues Lili Yu, Howard Chen, Sida Wang, Tao Lei and I described our approach in our ACL 2020 paper.

We take a low-overhead approach, and add limited interaction to intent classification. Our goal is two-fold:

  1. study the effect of interaction on the system performance, and
  2. avoid the cost and complexities of interactive data collection.

We add simple interactions on top of natural language intent classification, with minimal development overhead, through clarification questions to collect missing information. Those questions can be binary or multi-choice. For example, the question “Do you have an online account?” is binary, with “yes” or “no” as answers. And the question “What is your phone operating system?” is multi-choice, with “OS”, “android” or “Windows” as answers. Given a question, the user responds to the system by selecting one answer from the set. At each turn, the system determines whether to ask an informative question, or to return the best prediction to the consumer.

ASAPP—The illustration above shows a running example of interactive classification in the FAQ suggestion domain.

The illustration above shows a running example of interactive classification in the FAQ suggestion domain. The consumer interacts with the system to find an intent from a list of possibilities. The interaction starts with the consumer’s initial query, “travel out of country”. As our system finds multiple good possible responses, highlighted on the right, it decides to ask a clarification question, “Do you need to activate global roaming service?” When the user responds with ‘yes’ it helps the system narrow down the best response candidate. After two rounds of interaction, a single good response is identified. Our system concludes the interaction by suggesting the FAQ document to the user. This is one full interaction, with the consumer’s initial query, system questions, consumer responses, and the system’s final response.

We select clarification questions to maximize the interaction efficiency, using an information gain criterion. Intuitively, we select the question that provides most information about the intent label by observing its answer. After receiving the consumer’s answer, we update the beliefs of intent labels using Beyes’ rule iteratively. Moreover, we balance between the potential increase in accuracy and the cost of asking additional questions with a learned policy controller that decides whether to ask additional questions or return the final prediction.

We designed two non-interactive data collection tasks to train different model components. This allows us to crowdsource the data at large scale and build a robust system at low cost. Our modeling approach leverages natural language encoding, and enables us to handle unseen intents and unseen clarification questions, further alleviating the need for expensive annotations and improving the scalability of our model.

ASAPP—ASAPP—ASAPP—Our work demonstrates the power of adding user interaction in two tasks: FAQ suggestion and bird identification.

Our work demonstrates the power of adding user interaction in two tasks: FAQ suggestion and bird identification. The FAQ task provides a trouble-shooting FAQ suggestion given a user query in a virtual assistant application. The bird identification task helps identify bird species from a descriptive text query about bird appearance. When real users interact with our system, given at most five turns of interaction, our approach improves the accuracy of a no-interaction baseline by over 100% on both tasks for simulated evaluation and over 90% for human evaluation. Even a single clarification question provides significant accuracy improvements, 40% for FAQ suggestion and 65% for bird identification in our simulated analysis.

This work allows us to quickly build an interactive classification system to improve customer experience by offering significantly more accurate predictions. It highlights how research and product complete each other in ASAPP: challenging product problems inspire interesting research ideas and original research solutions improve product performance. Together with other researchers, Lili Yu is organizing the first workshop on interactive learning for natural language processing in the coming ACL 2021 conference to further discuss the method, evaluation and scenarios of interactive learning.

Automation
Customer Experience
Future of CX
Articles

Four AI-centered CX insights from fortune 500 CXOs

by 
Michael Lawder
Article
Video
Apr 23
2 mins

In recent years, the rapid advance of new data and mobile applications has created a heightened threshold for customer expectations. ASAPP and a suite of forward-looking Chief Experience Officers (CXOs) who represent companies with over $450 billion in market value met to discuss how machine learning (ML), speech recognition, and natural language processing (NLP) are generating higher agent productivity, efficiency, and cost reductions.

Four key AI-centered insights arose from the CXOs who see 2021 as an opportunity to realize the promise of artificial intelligence to radically improve customer experience (CX).

1—Automation creates opportunity for emotive, human-driven service

On the surface, “automation” and “human-driven” seem like two opposing forces. A legacy approach considers automation solely to enable customer self-service, taking human agents out of the customer journey. While self-service will persist in specific applications, automating contact center agents’ repetitive tasks allows a focus on what matters most: providing excellent customer service and representing a brand positively.

AI is opening new avenues to create personalized experiences for customers. With an AI platform, agents know what the customer is facing at a given time, their history with the company, and their communication preferences. Automating their administrative tasks in note-taking and multitasking enables agents to be stronger brand ambassadors in spending more mental energy providing an emotive, high-touch, response to customer needs.

2—AI-driven real-time insights is the next big opportunity for supervisors and coaches

Previously, an in-person presence at call centers afforded managers the ability to monitor and assist agents shoulder-to-shoulder. But in today’s digital workplace, managers have turned to less streamlined methods of using webcam and Slack to support agents. This approach has made it harder for managers to supervise and coach teams, and the introduction of new digital systems has added increasing complexity for front-line agents.

CXOs are beginning to see the promise of ML, NLP, and automatic speech recognition technologies to power live voice transcription. These AI technologies enable managers to supervise and support agents in real-time, guiding agents at the moment they need assistance. After each customer engagement, ML-generated reports and summaries allow managers to digest previous interactions, understand where agents are facing challenges, and improve agent performance. With the AI analyzed data, managers can adjust strategy and coaching in real-time to nimbly respond to the business challenges they face.

In the near future, CXOs expect the confluence of ML, NLP, and automatic speech recognition technologies to provide insight for the next golden opportunity: determining caller intent to more rapidly detect what a caller needs, assess their emotional state, and have them automatically routed to the appropriate agent.

Michael Lawder
CXOs are excited by the opportunities AI presents. They expect this technology to help their organizations be much more productive and at the same time, differentiate themselves by providing exceptional customer experience.

Michael Lawder

3—Measure what matters for holistic data-driven decision making

Thanks to the advance of ML, businesses are able scale pattern recognition and automation from their own data. In 2021, the businesses we speak to are going beyond “bean-counting” to unearth correlation-driven insights for strategic business decisions. Outliers and anecdotes are steadily coming together to illustrate, for example, that mobile device users are more willing to have synchronous conversations than desktop users—an insight which may affect routing processes. To detect these patterns, CX teams are looking to ensure that they have individuals with the knowledge to contextualize the data and to build systems to reliably measure it.

However, in the effort to become a digital-first business, building a comprehensive data lake remains a challenge. Businesses are still struggling to compile timely, quality data at a granularity that can be integrated with other data sets. The preservation, and architecture, of legacy systems has led to continued data silos that makes it hard for decision-makers to see the big picture in the customer journey. CX leaders should demand more from their IT teams and service providers to streamline this data to successfully arm businesses and teams to make changes.

And it’s not just technical IT teams who have a responsibility in building this data treasury. All employees have a role in ensuring that the business is flagging data for data driven decision making. The first step begins in making a cultural mind shift to view data as an important corporate asset.

4—Today’s AI and digital technology shouldn’t be used with yesterday’s paradigm

Many of the à la carte solutions found in today’s contact centers were built for a different time. In decades past, businesses relied on outsourcing to balance costs and scale service which often came at the cost of the customer experience. In the 2010s, IVRs and chatbots offered a way to triage workloads but rarely provided a stellar experience for customers. Today, many contact centers are left sustaining a costly myriad of legacy systems that were not designed for a cohesive customer experience. A real transformation to improve customer experiences requires a rethink of how the customer journey operates.

At ASAPP, we’re doing this by putting a focus on making people better with AI. This has meant a change in everything we create from the ground up for vertically integrated AI and human productivity. We’re changing how we measure ourselves, and interact with customers. For example, IVRs and legacy systems may deliver cost savings, but they may actually exacerbate customer frustration. An analogy I like to use when describing this new paradigm for CX is like building a train to fly. Instead of spending the significant and inefficient resourcing to make trains fly, at ASAPP, we’re building an airplane.

Chief Experience Officers are excited by a future driven by AI: making organizations highly productive and effective by augmenting human activity and automating the world’s workflows. I can’t wait to see what new insights we’ll unearth at our next meeting.

Want to see what makes people so excited to partner with us? Get in touch, or give me a tweet at @michael_lawder.

AI Research
Machine Learning
R&D Innovations
Articles

From network compression to DenseNets

by 
Kilian Weinberger
Article
Video
Apr 7
2 mins

The history of artificial neural networks started in 1961 with the invention of the “Multi-Layer Perceptron” (MLP) by Frank Rosenblatt at Cornell University. Forty years later, neural networks are everywhere: from self-driving cars and internet search engines, to chatbots and automated speech recognition systems.

The DenseNet architecture, connects each layer directly with all subsequent layers (of the same size).

The DenseNet architecture, connects each layer directly with all subsequent layers (of the same size).

Shallow Networks

When Rosenblatt introduced his MLP he was limited by the computational capabilities of his time. The architecture was fairly simple: The neural network had an input layer, followed by a single hidden layer, which fed into the output neuron. In 1989 Kurt Hornik and colleagues from Vienna University of Technology proved that this architecture is a universal approximator, which loosely means that it can learn any function that is sufficiently smooth—provided the hidden layer has enough neurons and the network is trained on enough data. To this day, Hornik’s result is an important milestone in the history of machine learning, but it had some unintended consequences. As multiple layers were computationally expensive to train, and Hornik’s theorem proved that one could learn everything with just a single hidden layer, the community was hesitant to explore deep neural networks.

Deep Networks

Everything changed as cheap GPUs started to proliferate the market. Suddenly matrix multiplications became fast, shrinking the additional overhead of deeper layers. The community soon discovered that multiple hidden layers allow a neural network to learn complicated concepts with surprisingly little data. By feeding the first hiddenlayer’s output into the second, the neural network could “reuse” concepts it learned early-on in different ways. One way to think about this is that the first layer learns to recognize low-level features (e.g. edges, or round shapes in images), whereas the last layer learns high-level abstractions arising from combinations of these low-level features (e.g. “cat”, “dot”). Because the low-level concepts are shared across many examples, the networks can be far more data-efficient than a single hidden layer architecture.

Network Compression

One puzzling aspect about deep neural networks is the sheer number of parameters that it learns. It is puzzling, because one would expect an algorithm with so many parameters to simply overfit, essentially memorizing the training data without the ability to generalize well. However, in practice, this is not what one observed. In fact, quite the opposite. Neural networks excelled at generalization across many tasks. In 2015 my students and I started wondering why that was the case. One hypothesis was that neural networks had millions of parameters but did not utilize them efficiently. In other words, their effective number of parameters could be smaller than their enormous architecture may suggest. To test this hypothesis we came up with an interesting experiment. If it is true that a neural network does not use all those parameters, we should be able to compress it into a much smaller size. Multilayer perceptrons store their parameters in matrices, and so we came up with a way to compress these weight matrices into a small vector, using the “hashing trick.” In our 2015 ICML paper Compressing Neural Networks with the Hashing Trick we showed that neural networks can be compressed to a fraction of their size without any noticeable loss and accuracy. In a fascinating follow-up publication, Song Han et al. showed in 2016 that if this practice is combined with clever compression algorithms one can reduce the size of neural networks even further, which won the ICLR 2016 best paper award and started a network compression craze among the community.

Kilian Weinberger
In a nutshell, we forced the network to store similar concepts across neighboring layers by randomly dropping entire layers during the training process. With this method, we could show that by increasing the redundancy we were able to train networks with over 1000 layers and still improve generalization error.

Kilian Weinberger, PhD

Stochastic Depth

Neural network compression has many intriguing applications, ranging from automatic speech recognition on mobile phones to embedded devices. However, the research community was still wondering about the phenomenon of parameter redundancy within neural networks. The success of network compression seemed to suggest that many parameters are redundant, so we were wondering if we could utilize this redundancy to our advantage. The hypothesis was that if redundancy is indeed beneficial to learning deep networks, maybe controlling it would allow us to learn even deeper neural networks. In our 2016 ECCV paper Deep Networks with Stochastic Depth, we came up with a mechanism to increase the redundancy in neural networks. In a nutshell, we forced the network to store similar concepts across neighboring layers by randomly dropping entire layers during the training process. With this method, we could show that by increasing the redundancy we were able to train networks with over 1000 layers and still improve generalization error.

DenseNets

The success of stochastic depth was scientifically intriguing, but as a method, it was a strange algorithm. In some sense, we created extremely deep neural networks (with over 1000 layers) and then made them so ineffective that the network as a whole didn’t overfit. Somehow this seemed like the wrong approach. We started wondering if we could create an architecture that had similarly strong generalization properties but wasn’t as inefficient.

On the popular CIFAR-10 data set, DenseNets with 0.8M parameters outperform ResNet-110 with 1.7M parameters.
On the popular CIFAR-10 data set, DenseNets with 0.8M parameters outperform ResNet-110 with 1.7M parameters.

One hypothesis why the increase in redundancy helped so much was that by forcing layers throughout the network to extract similar features, the early low-level features were available even for later layers. Maybe they were still useful when higher-level features are extracted. We, therefore, started experimenting with additional skip connections that would connect any layer to every subsequent layer. The idea was that in this way each layer has access to all the previously extracted features—which has three interesting advantages:

  1. It allows all layers to use all previously extracted features.
  2. The gradient flows directly from the loss function to every layer in the network.
  3. We can substantially reduce the number of parameters in the network.

Our initial results with this architecture were very exciting. We could create much smaller networks than the previous state-of-the-art, ResNets, and even outperform stochastic depth. We refer to this architecture as DenseNets, and the corresponding publication was honored with the 2017 CVPR best paper award.

A comparison of the DenseNet and ResNet architecture on CIFAR-10. The DenseNet is more accurate and parameter efficient.

A comparison of the DenseNet and ResNet architecture on CIFAR-10. The DenseNet is more accurate and parameter efficient.

If previous networks could be interpreted as extracting a “state” that is modified and passed on from one layer to the next, DenseNets changed this setup so that each layer has access to all the “knowledge” extracted from all previous layers and adds its own output to this collective state. Instead of copying features from one layer to the next, over and over, the network can use its limited capacity to learn new features. Consequently, DenseNets are far more parameter efficient than previous networks and result in significantly more accurate predictions. For example, on the popular CIFAR-10 benchmark dataset, they almost halved the error rate of ResNets. Most impressively, out-of-the-box, they achieved new record performance on the three most prominent image classification data sets of the time: CIFAR-10, CIFAR-100, and ImageNet.

(Left:) the chaotic loss landscape of a ResNet 50 with skip connections removed. (Right:) the convex loss landscape of a Densenet 121 (with skip-connections). [Figure from Li et al. 2017]
Visualizing the Loss Landscape of Neural Nets (2018)

There may be other benefits from the additional skip connections. In 2017, Li et al. examined the loss surface around the local minimum that neural networks converge to. They found that as networks became deeper, these surfaces became highly non-convex and chaotic—increasing the difficulty to find a local minimum that generalizes beyond the training data. Skip-connections smooth out these surfaces, aiding the optimization process. The exact reasons are still the topic of open research.

No results found.
No items found.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get Started

AI Services Value Calculator

Estimate your cost savings

contact us

Request a Demo

Transform your enterprise with generative AI • Optimize and grow your CX •
Transform your enterprise with generative AI • Optimize and grow your CX •