List: Document Information Processing | Curated by Ritvik Rastogi

Nov 22, 2024

23 stories

9 saves

Document Information Processing
A lightweight extension to traditional LLMs that focuses on reasoning over visual documents, by incorporating textual semantics and spatial layout without expensive image encoders.
Ritvik Rastogi
Papers Explained 87: DocLLMDocLLM is a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account only…
Jan 8, 2024
100
1
Jan 8, 2024
100
1
A methodology to adapt arbitrary LLMs for document information extraction.
Ritvik Rastogi
Papers Explained 248: LMDXThe main obstacles to LLM adoption in semi structured document information extraction tasks have been the absence of layout encoding within…
Nov 8, 2024
26
Nov 8, 2024
26
A Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.
Ritvik Rastogi
Papers Explained 257: NougatNougat (Neural Optical Understanding for Academic Documents) is a Visual Transformer model that performs an Optical Character Recognition…
Nov 21, 2024
Nov 21, 2024
Explicitly models geometric relations in pre-training and enhances feature representation.
Ritvik Rastogi
Papers Explained 258: GeoLayoutLMVisual information extraction (VIE) is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Most of the…
Nov 22, 2024
11
Nov 22, 2024
11
Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 42: UDOPUniversal Document Processing (UDOP) is a foundation Document AI model which unifies text, image, and layout modalities together with…
Apr 18, 2023
50
Apr 18, 2023
50
Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 24: ERNIE LayoutGiven a document, ERNIE-Layout rearranges the token sequence with the layout knowledge and extracts visual features from the visual…
Feb 8, 2023
50
Feb 8, 2023
50
A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 13: Layout LM v3LayoutLMv3 applies a unified text-image multimodal Transformer to learn cross-modal representations. The Transformer has a multilayer…
Feb 6, 2023
109
1
Feb 6, 2023
109
1
Built upon MatCha, standardises plot to table task, translating plots into linearized tables (markdown) for processing by LLMs.
Ritvik Rastogi
Papers Explained 256: DePlotThis paper presents the first few(one)- shot solution to visual language reasoning. It proposes to decompose visual language reasoning into…
Nov 20, 2024
1
Nov 20, 2024
1
Leverages Pix2Struct, and introduces pretraining tasks focused on math reasoning and chart derendering to improve chart and plot comprehension, enhancing understanding in diverse visual language tasks.
Ritvik Rastogi
Papers Explained 255: MatchaMatcha (Math reasoning and Chart derendering pretraining) propose several pre-training tasks that cover plot deconstruction and numerical…
Nov 19, 2024
2
Nov 19, 2024
2
A pretrained image-to-text model designed for visual language understanding, particularly in tasks involving visually-situated language.
Ritvik Rastogi
Papers Explained 254: Pix2StructPix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing…
Nov 18, 2024
Nov 18, 2024
An Image Transformer pre-trained (self-supervised) on document images
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 19: DitDiT is a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks.
We…
Feb 7, 2023
12
Feb 7, 2023
12
A seq2seq model that accurately predicts reading order, text, and layout information from document images.
Ritvik Rastogi
Papers Explained 247: Layout ReaderLayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It performs almost perfectly in…
Nov 7, 2024
25
Nov 7, 2024
25
Built upon BERT, encodes relative positions of texts in 2D space and learns from unlabeled documents with area masking strategy.
Ritvik Rastogi
Papers Explained 246: BROSThe main Transformer structure of BROS is the same as BERT. BROS (BERT Relying On Spatiality) encodes relative positions of texts in 2D…
Nov 6, 2024
21
Nov 6, 2024
21
Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 12: LiLTThe whole framework can be regarded as a parallel dual-stream Transformer. Given an input document image, first an off-the-shelf OCR engine…
Feb 6, 2023
21
Feb 6, 2023
21
An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 20: DonutDonut is an end-to-end (i.e., self-contained) VDU model for general understanding of document images. The architecture of Donut is quite…
Feb 7, 2023
21
1
Feb 7, 2023
21
1
Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 30: DocFormerConceptual Overview
Feb 9, 2023
10
Feb 9, 2023
10
Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 23: Structural LMTaking advantage of existing pretrained language models and to adapt to document image understanding tasks, Structural LM uses the BERT…
Feb 7, 2023
10
Feb 7, 2023
10
A library integrating Detectron2, CNN-RNN OCR, layout structures, TensorFlow/PyTorch, and a Model Zoo. The toolkit features Tesseract, Google Cloud Vision for OCR, active learning tools and community platform ensures efficiency and adaptability.
Ritvik Rastogi
Papers Explained 245: Layout ParserLayoutParser is an open-source library designed to streamline the application of deep learning (DL) in document image analysis (DIA)…
Nov 5, 2024
20
Nov 5, 2024
20
Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 11: Layout LM v2LayoutLMv2 architecture is proposed with new pre-training tasks to model the interaction among text, layout, and image in a single…
Feb 6, 2023
21
2
Feb 6, 2023
21
2
Formulates Information Extraction (IE) as a spatial dependency parsing problem.
Ritvik Rastogi
Papers Explained 253: SPADEInformation Extraction (IE) for semistructured document images is often approached as a sequence tagging problem by classifying each…
Nov 15, 2024
10
Nov 15, 2024
10

Document Information Processing

Papers Explained 87: DocLLM

DocLLM is a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account only…

Papers Explained 248: LMDX

The main obstacles to LLM adoption in semi structured document information extraction tasks have been the absence of layout encoding within…

Papers Explained 257: Nougat

Nougat (Neural Optical Understanding for Academic Documents) is a Visual Transformer model that performs an Optical Character Recognition…

Papers Explained 258: GeoLayoutLM

Visual information extraction (VIE) is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Most of the…

Papers Explained 42: UDOP

Universal Document Processing (UDOP) is a foundation Document AI model which unifies text, image, and layout modalities together with…

Papers Explained 24: ERNIE Layout

Given a document, ERNIE-Layout rearranges the token sequence with the layout knowledge and extracts visual features from the visual…

Papers Explained 13: Layout LM v3

LayoutLMv3 applies a unified text-image multimodal Transformer to learn cross-modal representations. The Transformer has a multilayer…

Papers Explained 256: DePlot

This paper presents the first few(one)- shot solution to visual language reasoning. It proposes to decompose visual language reasoning into…

Papers Explained 255: Matcha

Matcha (Math reasoning and Chart derendering pretraining) propose several pre-training tasks that cover plot deconstruction and numerical…

Papers Explained 254: Pix2Struct

Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing…

Papers Explained 19: Dit

DiT is a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks. We…

Papers Explained 247: Layout Reader

LayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It performs almost perfectly in…

Papers Explained 246: BROS

The main Transformer structure of BROS is the same as BERT. BROS (BERT Relying On Spatiality) encodes relative positions of texts in 2D…

Papers Explained 12: LiLT

The whole framework can be regarded as a parallel dual-stream Transformer. Given an input document image, first an off-the-shelf OCR engine…

Papers Explained 20: Donut

Donut is an end-to-end (i.e., self-contained) VDU model for general understanding of document images. The architecture of Donut is quite…

Papers Explained 30: DocFormer

Conceptual Overview

Papers Explained 23: Structural LM

Taking advantage of existing pretrained language models and to adapt to document image understanding tasks, Structural LM uses the BERT…

Papers Explained 245: Layout Parser

LayoutParser is an open-source library designed to streamline the application of deep learning (DL) in document image analysis (DIA)…

Papers Explained 11: Layout LM v2

LayoutLMv2 architecture is proposed with new pre-training tasks to model the interaction among text, layout, and image in a single…

Papers Explained 253: SPADE

Information Extraction (IE) for semistructured document images is often approached as a sequence tagging problem by classifying each…

Ritvik Rastogi