List: Encoder-Only Language Transformers | Curated by Ritvik Rastogi

Aug 9, 2024
10 stories
5 saves
Encoder-Only Language Transformers
Enhances the DeBERTa architecture by introducing replaced token detection (RTD) instead of mask language modeling (MLM), along with a novel gradient-disentangled embedding sharing method, exhibiting superior performance across various natural language understanding tasks.
Ritvik Rastogi
Papers Explained 182: DeBERTa V3DeBERTaV3 improves the original DeBERTa model by replacing masked language modeling (MLM) with replaced token detection (RTD), a more…
Aug 9, 2024
21
Aug 9, 2024
21
Enhances BERT and RoBERTa through disentangled attention mechanisms, an enhanced mask decoder, and virtual adversarial training.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 08: DeBERTaDeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques.
Feb 6, 2023
13
Feb 6, 2023
13
Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 36: MobileBERTMobileBERT is as deep as BERTLARGE, but each building block is made much smaller, the hidden dimension of each building block is only 128…
Mar 15, 2023
51
Mar 15, 2023
51
A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 37: FastBERTFastBERT is a novel speed-tunable language transformer with adaptive inference time.
Mar 17, 2023
21
1
Mar 17, 2023
21
1
Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 06: Distil BERTKnowledge distillation is a compression technique in which a compact model (the student) is trained to reproduce the behaviour of a larger…
Feb 6, 2023
10
Feb 6, 2023
10
Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 07: ALBERTALBERT presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT
Feb 6, 2023
11
Feb 6, 2023
11
Uses attention transfer, and task specific distillation for distilling BERT.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 05: Tiny BERTKnowledge Distillation aims to transfer the knowledge of a large teacher network T to a small student network S. Let fT and fS represent…
Feb 6, 2023
40
Feb 6, 2023
40
A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 04: Sentence BERTBERT and RoBERTa require that both sentences are fed into the network, which causes a massive computational overhead: Finding the most…
Feb 6, 2023
13
Feb 6, 2023
13
Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 03: RoBERTaRoBERTa presents a replication study of BERT pretraining that carefully measures the impact of many key hyperparameters and training data…
Feb 6, 2023
33
Feb 6, 2023
33
Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 02: BERTBERT introduced two steps training framework: pre-training and fine-tuning.
Feb 6, 2023
40
Feb 6, 2023
40