Ritvik Rastogi

Nov 6, 2024

9 stories

4 saves

Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.
A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
Built upon BERT, encodes relative positions of texts in 2D space and learns from unlabeled documents with area masking strategy.
Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
Utilises BERT as the backbone, adds two new input embeddings: 2-D position embedding and image embedding (Only for downstream tasks).
Ritvik Rastogi

Ritvik Rastogi

Data Scientist, 2x Kaggle Expert