Papers Explained 05: Tiny BERT

Published in

DAIR.AI

4 min readFeb 6, 2023

Knowledge Distillation aims to transfer the knowledge of a large teacher network T to a small student network S. Let fT and fS represent the behavior functions of teacher and student networks, respectively.

In the context of Transformer distillation, the output of MHA layer or FFN layer, or some intermediate representations (such as the attention matrix A) can be used as behavior function. Formally, KD can be modeled as minimizing the following objective function:

where L(·) is a loss function that evaluates the difference between teacher and student networks, x is the text input and X denotes the training dataset.

Assuming that the student model has M Transformer layers and teacher model has N Transformer layers, we start with choosing M out of N layers from the teacher model for the Transformer-layer distillation. Then a function n = g(m) is defined as the mapping function between indices from student layers to teacher layers.

Thus, the student can acquire knowledge from the teacher by minimizing the following objective:

where Llayer refers to the loss function of a given model layer. and λm is the hyperparameter that represents the importance of m-th layers’s distillation.

Transformer Layer Distillation

where h is the number of attention heads and Ai refers to the attention matrix corresponding to the i-th head.

where the matrics HS and HT refer to hidden states of student and teacher networks respectively. The matrix Wh is a learnable linear transformation which transforms the hidden states of student network into the same space as the teacher network’s states.

Embedding Layer Distillation

where the matrices ES and ET refer to the embeddings of student and teacher networks, respectively. The matrix We is a linear transformation playing a similar role as Wh.

Prediction Layer Distillation

where ZT and zS are the logits vectors predicted by the student and teacher respectively and t means the temperature value. In the experiments, it was found that t = 1 performs well.

Unified Distillation Loss

Using the above distillation objectives, we can unify the distillation loss of the corresponding layers between the teacher and the student network:

TinyBERT Learning

TinyBERT proposed a novel two-stage learning framework including the general distillation and the task-specific distillation.

General distillation helps TinyBERT learn the rich knowledge embedded in pre-trained BERT, which plays an important role in improving the generalization capability of TinyBERT. The task-specific distillation further teaches TinyBERT the knowledge from the fine-tuned BERT.

TinyBERT Settings

TinyBERT4

Student: TinyBERT4 (M=4, d=312, d’=1200 h=12) has a total of 14.5M parameters
Teacher: BERT BASE (M=12, d=768, d’=3072 h=12) has a total of 109M parameters
g(m) = 3m, \lambda = 1λ=1

TinyBERT6

Student: TinyBERT6 (M=6, d=768, d’=3072 h=12) has a total of 14.5M parameters
Teacher: BERT BASE (M=12, d=768, d’=3072 h=12) has a total of 109M parameters
g(m) = 2m, \lambda = 1λ=1

Results

Paper

TinyBERT: Distilling BERT for Natural Language Understanding 1909.10351

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!

DAIR.AI

Papers Explained 05: Tiny BERT

TinyBERT Learning

Results

Paper

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in DAIR.AI

Written by Ritvik Rastogi

No responses yet