List: Layout Aware Transformers | Curated by Ritvik Rastogi

Nov 6, 2024
9 stories
4 saves
Layout Aware Transformers
Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 24: ERNIE LayoutGiven a document, ERNIE-Layout rearranges the token sequence with the layout knowledge and extracts visual features from the visual…
Feb 8, 2023
Feb 8, 2023
A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 13: Layout LM v3LayoutLMv3 applies a unified text-image multimodal Transformer to learn cross-modal representations. The Transformer has a multilayer…
Feb 6, 2023
1
Feb 6, 2023
1
Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 12: LiLTThe whole framework can be regarded as a parallel dual-stream Transformer. Given an input document image, first an off-the-shelf OCR engine…
Feb 6, 2023
Feb 6, 2023
Built upon BERT, encodes relative positions of texts in 2D space and learns from unlabeled documents with area masking strategy.
Ritvik Rastogi
Papers Explained 246: BROSThe main Transformer structure of BROS is the same as BERT. BROS (BERT Relying On Spatiality) encodes relative positions of texts in 2D…
Nov 6, 2024
Nov 6, 2024
Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 30: DocFormerConceptual Overview
Feb 9, 2023
Feb 9, 2023
Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 23: Structural LMTaking advantage of existing pretrained language models and to adapt to document image understanding tasks, Structural LM uses the BERT…
Feb 7, 2023
Feb 7, 2023
Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 11: Layout LM v2LayoutLMv2 architecture is proposed with new pre-training tasks to model the interaction among text, layout, and image in a single…
Feb 6, 2023
2
Feb 6, 2023
2
Utilises RoBERTa as the backbone and adds Layout embeddings along with relative bias.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 41: LAMBERTLAMBERT introduces a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics…
Apr 10, 2023
Apr 10, 2023
Utilises BERT as the backbone, adds two new input embeddings: 2-D position embedding and image embedding (Only for downstream tasks).
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 10: Layout LMLayoutLM is a Neural Network that jointly models interactions between text and layout information across scanned document images, thus is…
Feb 6, 2023
Feb 6, 2023