8 Innovative BERT Knowledge Distillation Papers That Have Changed The Landscape of NLP - KDnuggets

2022-09-10 06:04:30 By : Mr. Hugo Hou

All of the papers present a particular point of view of findings in the BERT utilization.

The article summarizes eight papers carefully selected from many papers related to BERT knowledge distillation. NLP model compression and acceleration is an active area of research and widely adapted in the industry to deliver low latency features and services to end users. 

To put it bluntly, the BERT model is used for converting words into numbers and enables you to train machine learning models on text data. Why? Because the machine learning models take input as numbers and not words.

  Image from Devlin et al., 2019

First and foremost, BERT is a language model that amplifies the high performance of several tasks. BERT (Bidirectional Encoder Representations from Transformers), published in 2018, triggered a fuss in the community of machine learning by providing a neoteric achieved outcomes in a broad spectrum of NLP tasks, namely language understanding, and question-answering.

The main attraction of BERT is employing the bidirectional training of Transformer, a prominent attention model for language modeling. But, as for my narration, here are a few things that make BERT so much better:

All of the papers present a particular point of view of findings in the BERT utilization.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper, and lighter

Authors propose a technique to pre-train a smaller general-purpose language representation model, termed DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior works investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase. We show that it is possible to reduce the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster. Loss is threefold, combining language modeling loss, distillation loss, and cosine-distance loss. The data used the same corpus as the original BERT model. Further, DistilBERT was trained on eight 16GB V100 GPUs for around 90 hours.

For an input x, the teacher outputs:

Consider that softmax and the notations that come with it; we’ll get back to it later. Nevertheless, if we want T and S to be close, we can apply a cross-entropy loss to S with T as a target. That is what we call teacher-student cross-entropy loss:

  T(x) is the teacher vector output, and S(x) is the student vector output Source.  

Key takeaway: This is an online distillation technique where the teacher and student models are trained.

Distilling Knowledge Learned in BERT for Text Generation

This paper presents a generic technique for using pre-trained language models to further refine text generation, excluding the specific parameter sharing, feature extraction, or augmenting with auxiliary tasks. Their presented Conditional MLM mechanism leverages unsupervised language models pre-trained on a large corpus followed by readjusting to supervised sequence-to-sequence tasks. The distillation approach they offered indirectly impacts the text generation model by delivering soft-label distributions only; hence is model-agnostic. Keys points are mentioned below.

  Illustration of distilling knowledge from BERT for text generation. Source  

Furthermore, the knowledge distillation technique used here is the same as the one in the original Distillation research paper, where we train the student network on the soft labels generated by the teacher network.

So, what makes this research paper stand out from the rest? Here is the explanation.

The key idea here is to distill the knowledge in BERT into a student model that can generate text, while previous works focused only on model compression to do the same task as the teacher model. Then, a fine-tuning of the BERT model is done, so that fine-tuned model can be used for text generation.

Let’s take the use case of language translation, X is the source language sentence, and Y is the target language sentence.

First Phase: Fine-tuning of the BERT model

Second Phase: Knowledge distillation of fine-tuned BERT model to Seq2Seq model

TinyBERT: Distilling BERT for Natural Language Understanding

The article proposes a novel Transformer distillation technique exclusively intended for knowledge distillation (KD) of the Transformer-based models. By leveraging this novel KD approach, the heap of knowledge encoded in a large teacher BERT can be efficaciously shifted to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture BERT's general domain and task-specific knowledge.

TinyBERT with four layers is empirically effective and achieves more than 96.8% of the performance of its teacher BERT-Base on the GLUE benchmark while being 7.5x smaller and 9.4x faster on inference. TinyBERT with four layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time. Moreover, TinyBERT, with six layers, performs on par with its teacher BERT-Base.

Moreover, this article proposed three main components for distilling transformer networks.

The main steps in TinyBERT training are as follows:

Key takeaway: This is an offline distillation technique where the teacher model BERT is already pre-trained. Then they did two separate distillations: one for generic learning and another for task-specific learning. The first step of generic distillation involves distillation for all kinds of layers: attention layers, embedding layers, and prediction layers.

FastBERT: a Self-distilling BERT with Adaptive Inference Time

They propose a fresh new speed-tunable FastBERT with adaptive inference time. The speed at inference can be flexibly adjusted under varying demands, while redundant calculation of samples is avoided. Moreover, this model adopts a unique self-distillation mechanism for fine-tuning, further enabling a greater computational efficacy with minimal loss in performance. Our model achieves promising results in twelve English and Chinese datasets. It can speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff.

This work applies self-distillation(training phase)and adaptive mechanism (during inference phase) techniques to NLP language models for efficiency improvements for the first time.

FastBERT model consists of backbone and branches:

It uses separate training steps for backbone and student classifiers. Parameters in one module are always frozen while another module is being trained. Three steps:

Let’s talk about inference time. With FastBERT, the inference is performed adaptively, i.e., the number of executed encoding layers within the model can be adjusted according to input sample complexity.

At each transformer layer, the uncertainty of a student classifier’s output is computed, and it is determined if the inference can be terminated depending upon a threshold. Here is how the adaptive inference mechanism works:

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

In this paper, the authors exhibit that rudimentary, lightweight neural networks can even be made competitive apart from architecture modifications, external training data, or additional input features. They propose distilling knowledge from BERT into a single-layer, bidirectional long short-term memory network (BiLSTM) and its siamese equivalent for sentence-pair tasks. Throughout numerous datasets in paraphrasing, natural language inference, and sentiment classification, they achieve comparable outcomes with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time. Further, their approach includes a fine-tuned BERT for teacher and BiLSTM student models. The primary motivation of this work comprises as follows:

  Source | Reference video by paper authors  

A small dataset may not be sufficient for teachers to express their knowledge fully, so the training set is augmented using a large unlabeled dataset with pseudo-labels generated from the teacher model. In this work, a few heuristics are proposed for task-agnostic data augmentation:

Patient Knowledge Distillation for BERT Model Compression

The authors propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Their method is quite distinct from previous knowledge distillation approaches because the earlier methods only use the output from the last layer of the teacher network for distillation; our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies:

They experimented on several datasets across different NLP tasks demonstrating that the proposed PKD approach achieves better performance and generalization than standard distillation methods (Hinton et al., 2015).

Why not learn from all the hidden states of the teacher model?

The reason is that it can be computationally very expensive and can introduce noise into the student model.

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

They propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic; that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. MobileBERT is a thin version of BERTʟᴀʀɢᴇ, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.

Step one: first train a specially designed teacher model, an inverted bottleneck incorporated BERTʟᴀʀɢᴇ model.

Step two: conduct knowledge transfer from this teacher to MobileBERT.

  Architecture visualization of transformer blocks within (a) BERT, (b) MobileBERT teacher, and © MobileBERT student. The green trapezoids marked with “Linear” are referred to as bottlenecks. Source  

(a) BERT; (b) Inverted-Bottleneck BERT (IB-BERT); and © MobileBERT. In (b) and ©, red lines denote inter-block flows while blue lines intra-block flow. MobileBERT is trained by layer-to-layer imitating IB-BERT.

If you have made it this far, you deserve a high-five. MobileBERT presents bottlenecks in transformer blocks, which distills the knowledge out of larger teachers into smaller students more smoothly. This approach decreases the width instead of the depth of the student, which is famous for generating a more proficient model which yields true in the given experiments. MobileBERT underlines the conviction that it’s achievable to make a student model that can be fine-tuned after the initial distillation process.

Moreover, the outcomes also indicate that this holds true in practice, as MobileBERT can attain 99.2% of BERT-base’s performance on GLUE with 4x fewer parameters and 5.5x faster inference on a Pixel 4 phone!

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

The key focus of the paper is as follows:

The architecture of the MT-DNN model for representation learning (Liu et al., 2019). The lower layers are shared across all tasks, while the top layers are task-specific. The input X (either a sentence or a set of sentences) is first represented as a sequence of embedding vectors, one for each word, in l1. Then the Transformer encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l2. Finally, additional task-specific layers generate task-specific representations for each task, followed by operations necessary for classification, similarity scoring, or relevance ranking. Source

Process of knowledge distillation for multi-task learning. A set of tasks where there is task-specific labeled training data are picked. Then, an ensemble of different neural nets (teacher) is trained for each task. The teacher is used to generate a set of soft targets for each task-specific training sample. Given the soft targets of the training datasets across multiple tasks, a single MT-DNN (student) is trained using multi-task learning and backpropagation as described in Algorithm 1, except that if task t has a teacher, the task-specific loss in Line 3 is the average of two objective functions, one for the correct targets and the other for the soft targets assigned by the teacher. Source

Achievements: On the GLUE datasets, the distilled MT-DNN creates a new state-of-the-art result on 7 out of 9 NLU tasks, including the tasks with no teacher, pushing the GLUE benchmark (single model) to 83.7%.

We show that the distilled MT-DNN retains nearly all of the improvements achieved by ensemble models while keeping the model size the same as the vanilla MT-DNN model.

Contemporary state-of-the-art NLP models are difficult to be utilized in production. Knowledge distillation offers tools for tackling such issues along with several others, but it has its quirks.

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models…arxiv.org

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations…arxiv.org

Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks…arxiv.org

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP)…arxiv.org

Language model pre-training, such as BERT, has significantly improved the performances of many natural language…arxiv.org

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models…arxiv.org

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally…arxiv.org

In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent…arxiv.org

Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP)…arxiv.org

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models…arxiv.org

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of…arxiv.org

This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural Network (MT-DNN) (Liu et al…arxiv.org

"Opinions expressed here are of Mr. Abhishek, not his employer"

    Kumar Abhishek is Machine Learning Engineer at Expedia, working in the domain of fraud detection and prevention. He uses Machine Learning and Natural Language Processing models for risk analysis and fraud detection. He has more than a decade of machine learning and software engineering experience.  

Get the FREE collection of 50+ data science cheatsheets and the leading newsletter on AI, Data Science, and Machine Learning, straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE collection of 50+ data science cheatsheets and the leading newsletter on AI, Data Science, and Machine Learning, straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Subscribe To Our Newsletter (Get 50+ FREE Cheatsheets)

Get the FREE collection of 50+ data science cheatsheets and the leading newsletter on AI, Data Science, and Machine Learning, straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE collection of 50+ data science cheatsheets and the leading newsletter on AI, Data Science, and Machine Learning, straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy