Felix Wu

Felix Wu, PhD is a Research Scientist at ASAPP. He received his Ph.D. in Computer Science from Cornell University under the supervision of Prof. Kilian Q. Weinberger and his B.S. in Computer Science and Information Engineering from National Taiwan University. His research interest includes Machine Learning and its applications such as Natural Language Processing and Computer Vision. Recently, he is focusing on designing efficient neural models.

Articles
R&D Innovations

Wav2vec could be more efficient, so we created our own pre-trained ASR Model for better Conversational AI.

by 
Felix Wu
Article
Video
Feb 3
2 mins

In recent years, research efforts in natural language processing and computer vision have worked to improve the efficiency of pre-trained models to avoid the financial and environmental costs associated with training and fine-tuning them. For whatever reason, we have not seen such efforts in speech. In addition to saving costs associated with more efficient training of pre-trained models, for speech, efficiency gains could also mean greater performance for similar inference times.

Today, Wav2vec 2.0 (W2V2) is arguably the most popular approach for using self-supervised training in speech. It has received a lot of attention and follow-up works for applying pre-trained W2V2 models to various downstream applications including speech-to-text translation (Wang et al., 2021) and named entity recognition (Shon et al., 2021). Yet, we hypothesize that there are many sub-optimal design choices in the model architecture that make it relatively inefficient. To justify this hypothesis, we conducted a series of experiments on different components of the W2V2 model architecture and exposed the performance-efficiency tradeoff of the W2V2 model design space. Higher performance (lower word error rate in ASR) requires a large pre-trained model and comes with lower efficiency (inference speed). Can we achieve a better tradeoff (similar performance with higher inference speed)?

What do we propose instead? A more efficient pre-trained model that also achieves better performance through its efficiency gains.

Squeezed and Efficient Wav2vec (SEW)

Based on our observations, we propose SEW (Squeezed and Efficient Wav2vec) and SEW-D (SEW with Disentangled attention) which can achieve a much better performance-efficiency tradeoff—with 1.9x speedup during inference, our smaller SEW-D-mid achieves 13.5% WERR (word error rate reduction) compared to W2V2-base on academic datasets. Our larger SEW-D-base+ model performs close to W2V2-large while operating at the same speed as W2V2-base. It only takes 1/4 of the training epochs to outperform W2V2-base which significantly reduces the pre-training cost.

SEW differs from conventional W2V2 models in three major modifications.

First, we introduce a compact waveform feature extractor which allocates the computation across layers more evenly. This makes the model faster without sacrificing performance.

  1. Second, we propose a “squeeze context network” which downsamples the audio sequence and reduces the computation and memory usage.
  2. This allows us to use a larger model without sacrificing inference speed.
  3. Third, we introduce MLP predictor heads during pre-training which improve the performance without any overhead in the downstream application since they will be discarded after pre-training.

SEW-D further replaces the normal self-attention with disentangled self-attention proposed in DeBERTa (He et al., 2020) which achieves better performance with half of the number of parameters and a significant reduction in both inference time and memory footprint.

The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality. The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.

Anton Lozhkov

Why it matters

These pre-trained models open the door for cost savings and/or performance gains for a number of downstream models in automatic speech recognition, speaker identification, intent classification, emotion recognition, sentiment analysis and named entity recognition. The speedup of a pre-trained model can be directly transferred to the downstream models. Because the pre-trained model is smaller and faster, the fine-tuned downstream model is also smaller and faster. These efficiency gains not only reduce their training/fine-tuning time but also the actual observed latency in products. Conversational AI systems using the SEW pre-trained models will be able to better detect what consumers are saying, who’s saying what, how they feel, and to provide faster response times.

“The SEW speech models by ASAPP are faster and require less memory, without sacrificing recognition quality,” explains Anton Lozhkov, Machine Learning Engineer at Hugging Face. “The architecture improvements proposed by the team are very easy to apply to other existing Wav2Vec-based models – essentially granting performance gains for free in applications such as automatic speech recognition, speaker identification, intent classification, and emotion recognition.”

Want to utilize the pre-trained models from ASAPP? See our paper and open source code for more details. Moreover, our pre-trained models are now available in Hugging Face’s transformers library and model hub. Our paper is accepted and will appear at ICASSP 2022. Please feel free to reach out to the authors in the post-session during the conference.

R&D Innovations
Transcription
Articles

Addressing instabilities for few-sample BERT fine-tuning

by 
Felix Wu
Article
Video
Apr 29
2 mins

The costs of BERT Fine-Tuning on small datasets

Fine-tuning BERT or its variants has become one of the most popular and effective methods to tackle natural language processing tasks, especially those with limited data. BERT models have been downloaded more than 5.6 millions of times from Huggingface’s public server.

However, fine-tuning remains unstable, especially when
using the large variant of BERT (BERTLarge) on small datasets, arguably the most impactful use of BERT-style models. Identical learning processes with different random seeds often result in significantly different and sometimes degenerate models following fine-tuning, even though only a few, seemingly insignificant aspects of the learning process are impacted by the random seed (Phang et al., 2018; Lee et al., 2020; Dodge et al., 2020). In layman’s terms: every time you train BERT for your task, you get different results. This means you need to train again and again to get a good system. This makes scientific comparison challenging (Dodge et al., 2020) and creates huge costs, which are potentially unnecessary.

While the variance comes from randomness, we hypothesize that the major cause of this instability lies in the optimization process.

Revisiting Few-sample BERT Fine-tuning

We conducted an extensive empirical analysis of BERT fine-tuning optimization behaviors on three aspects to identify the root cause of instability:

  1. The Optimization Algorithm
  2. We found that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the main cause of degenerate models.
  3. The Initialization
  4. We found that re-initializing the top few layers of BERT stabilizes the fine-tuning procedure.
  5. The Number of Training Iterations
  6. We found that the model still requires hundreds of updates to converge.

1. Optimization Algorithm

We observed that omitting debiasing in the BERTAdam algorithm (Devlin et al., 2019) is the lead cause of degenerate fine-tuning runs. The following is a pseudo-code of the Adam algorithm (Kingma & Ba, 2014). BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

ASAPP—BERTAdam omits lines 9 and 10 which are used to correct the biases in the first and second moment estimates.

Fine-tuning BERT with the original Adam (with bias correction) eradicates almost all degenerate model training outcomes and reduces the variance across multiple randomized trials. Here, we show the test performance distribution of 20 random trials with or without bias correction on four small datasets.

ASAPP—Here, we show the test performance distribution of 20 random trails with or without bias correction on four small datasets.

Since the variance is significantly reduced, practitioners can easily get a decent model within only one to five trials instead of fine-tuning up to 20 models and picking the best one.

2. Initialization

We hypothesized that the top pre-trained layers of BERT are specific to the pre-training task and may not transfer to a dissimilar downstream task. We propose to re-initialize the top few layers of BERT to ease the fine-tuning procedure. We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers.

ASAPP—We plot the training curves with and without re-initialization below, showing consistent improvement for models with re-initialized output layers

The following figure shows the validation performance with different numbers of re-initialized layers. As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks.

ASAPP—As we can see, re-initializing a single is already beneficial, while the best number of layers to re-initialize depends on the downstream tasks

3. Number of Training Iterations

ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge.

ASAPP—We also studied the conventional 3-epoch fine-tuning setup of BERT. Through extensive experiments on various datasets, we observe that the widely adopted 3-epoch setup is insufficient for few-sample datasets. Even with few training examples, the model still requires hundreds of updates to converge

Revisiting Existing Methods for Few-sample BERT Fine-tuning

Instability in BERT fine-tuning, especially in few-sample settings, has been receiving significant attention recently. We revisited these methods given our analysis of the fine-tuning process, focusing on the impact of using the debiased Adam instead of BERTAdam.

To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. “Int. Task” stands for transferring via an intermediate task (MNLI), “LLRD” stands for layerwise learning rate decay, “WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

ASAPP—To illustrate, the following figure shows the mean test performance and standard deviation on four datasets. ``Int. Task” stands for transferring via an intermediate task (MNLI), ``LLRD” stands for layerwise learning rate decay, ``WD’’ stands for weight decay. Numbers that are statistically significantly better than the standard setting (left column) are in blue and underlined.

We found that the standard fine-tuning procedure using bias-corrected Adam already has a fairly small variance, making these more complex techniques largely unnecessary. Moreover, re-initialization and training longer can serve as simple yet hard to beat baselines that outperforms previous methods except “Int. Task’’ on RTE. The reason is that RTE is very similar to MNLI (the intermediate task).

Why this work matters

This work carefully investigates the current, broadly adopted optimization practices in BERT fine-tuning. Our findings significantly stabilize BERT fine-tuning on small datasets. Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Our findings are focused on few-sample training scenarios, which opens, or at least eases the way for new applications at reduced data costs. The reduction in cost broadens the accessibility and reduces the energy footprint of BERT-based models. Applications that require frequent re-training are now easier and cheaper to deploy given the reduced training costs. This work also simplifies the scientific comparison between future fine-tuning methods by making training more stable, and therefore easier to reproduce.

Felix Wu
Stable training has multiple benefits. It reduces deployment costs and time, potentially making natural language processing applications more feasible and affordable for companies and individuals with limited computational resources.

Felix Wu, PhD

Read The Complete Paper:

This work has been accepted and will be published in ICLR 2021. Visit our poster during the virtual conference—Poster Session 2: May 3, 2021, 9 a.m. PDT & May 3, 2021, 11 a.m. PDT—to have some conversations with the authors.

Get Started

AI Services Value Calculator

Estimate your cost savings

contact us

Request a Demo

Transform your enterprise with generative AI • Optimize and grow your CX •
Transform your enterprise with generative AI • Optimize and grow your CX •