Nov 22, 2024
23 stories
5 saves
A lightweight extension to traditional LLMs that focuses on reasoning over visual documents, by incorporating textual semantics and spatial layout without expensive image encoders.
A methodology to adapt arbitrary LLMs for document information extraction.
A Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.
Explicitly models geometric relations in pre-training and enhances feature representation.
Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation.
Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.
A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
Built upon MatCha, standardises plot to table task, translating plots into linearized tables (markdown) for processing by LLMs.
Leverages Pix2Struct, and introduces pretraining tasks focused on math reasoning and chart derendering to improve chart and plot comprehension, enhancing understanding in diverse visual language tasks.
A pretrained image-to-text model designed for visual language understanding, particularly in tasks involving visually-situated language.
An Image Transformer pre-trained (self-supervised) on document images
A seq2seq model that accurately predicts reading order, text, and layout information from document images.
Built upon BERT, encodes relative positions of texts in 2D space and learns from unlabeled documents with area masking strategy.
Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text.
Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model.
A library integrating Detectron2, CNN-RNN OCR, layout structures, TensorFlow/PyTorch, and a Model Zoo. The toolkit features Tesseract, Google Cloud Vision for OCR, active learning tools and community platform ensures efficiency and adaptability.
Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
Formulates Information Extraction (IE) as a spatial dependency parsing problem.