Ritvik Rastogi

Nov 22, 2024

23 stories

5 saves

Document Information Processing

A lightweight extension to traditional LLMs that focuses on reasoning over visual documents, by incorporating textual semantics and spatial layout without expensive image encoders.
A Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.
Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation.
Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.
A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
Built upon MatCha, standardises plot to table task, translating plots into linearized tables (markdown) for processing by LLMs.
Leverages Pix2Struct, and introduces pretraining tasks focused on math reasoning and chart derendering to improve chart and plot comprehension, enhancing understanding in diverse visual language tasks.
A pretrained image-to-text model designed for visual language understanding, particularly in tasks involving visually-situated language.
Built upon BERT, encodes relative positions of texts in 2D space and learns from unlabeled documents with area masking strategy.
Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text.
Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model.
A library integrating Detectron2, CNN-RNN OCR, layout structures, TensorFlow/PyTorch, and a Model Zoo. The toolkit features Tesseract, Google Cloud Vision for OCR, active learning tools and community platform ensures efficiency and adaptability.
Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
Ritvik Rastogi

Ritvik Rastogi

Data Scientist, 2x Kaggle Expert