List: Multi Modal Transformers | Curated by Ritvik Rastogi

Dec 10, 2024
36 stories
13 saves
Multi Modal Transformers
Improves upon its predecessor by using more advanced language models from the Gemma 2 family while retaining the original vision encoder, leading to better performance across various tasks and exploring new tasks.
Ritvik Rastogi
Paper Explained 268: PaliGemma2PaliGemma 2 is an upgrade of PaliGemma by replacing its language model component with the more recent and more capable language models from…
Dec 9, 2024
22
Dec 9, 2024
22
A collection of small, efficient, open-source vision-language models built on top of Danube, trained on 37 million image-text pairs, specifically designed to perform well on document analysis and OCR tasks while maintaining strong performance on general vision-language benchmarks.
Ritvik Rastogi
Papers Explained 251: H2OVL-MississippiH2O VL Mississippi is a collection of smaller vision-language models, including H2OVL-Mississippi-0.8B and H2OVL-Mississippi-2B. These…
Nov 13, 2024
11
Nov 13, 2024
11
A family of multimodal large language models designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning, achieved through a data-centric approach involving diverse data mixtures, And specialized variants for video and mobile UI understanding.
Ritvik Rastogi
Papers Explained 261: MM-1.5MM-1.5 is a family of multimodal large language models that builds upon the MM1 architecture. It adopts a data-centric approach to model…
Nov 28, 2024
1
Nov 28, 2024
1
A family of open-weight vision-language models that achieve state-of-the-art performance by leveraging a novel, human-annotated image caption dataset called PixMo.
Ritvik Rastogi
Papers Explained 241: Pixmo and MolmoMolmo (Multimodal Open Language Model) utilizes PixMo (Pixels for Molmo), a high-quality dataset of detailed image captions collected from…
Oct 29, 2024
31
Oct 29, 2024
31
A family of multimodal large language models, provides a comparison between decoder-only multimodal LLMs and cross-attention based models and proposes a hybrid architecture, it further introduces a 1-D title-tagging design for tile-based dynamic high resolution images.
Ritvik Rastogi
Papers Explained 240: NVLMNVLM 1.0 is a family of multimodal large language models (LLMs) rivaling proprietary and open-access models. Notably, NVLM 1.0 shows…
Oct 28, 2024
20
Oct 28, 2024
20
A 12B parameter natively multimodal vision-language model, trained with interleaved image and text data demonstrating strong performance on multimodal tasks, and excels in instruction following.
Ritvik Rastogi
Papers Explained 219: PixtralPixtral is a 12B parameter natively multimodal vision-language model based on Mistral Nemo. It is trained with interleaved image and text…
Sep 26, 2024
20
Sep 26, 2024
20
Provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions, and reveals several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach.
Ritvik Rastogi
Papers Explained 269: EagleThis work systematically investigates the mixture-of-vision-encoders design space for improved MLLM perception and leads to several…
Dec 10, 2024
9
Dec 10, 2024
9
A family of visual language models that enables image and video understanding with improved training recipes, exploring enhanced vision-language fusion, higher input resolution, and broader modalities and applications.
Ritvik Rastogi
Papers Explained 236: CogVLM2The CogVLM2 family is a new generation of visual language models for image and video understanding. The family includes three models…
Oct 22, 2024
20
Oct 22, 2024
20
A VLM based on Llama 3.1 and SigLIP-SO400M trained efficiently, using only open datasets and a straightforward pipeline, significantly outperforming in document understanding tasks.
Ritvik Rastogi
Papers Explained 218: Idefics 3This paper can be seen as a tutorial for building a VLM. It begins by providing a comprehensive overview of the current state-of-the-art…
Sep 25, 2024
22
Sep 25, 2024
22
A comprehensive system for developing Large Multimodal Models, comprising curated datasets, training recipes, model architectures, and pre-trained models that demonstrate strong in-context learning capabilities and competitive performance on various tasks.
Ritvik Rastogi
Papers Explained 190: BLIP-3 (xGen-MM)xGen-MultiModal (xGen-MM) also known as BLIP-3, expands the Salesforce xGen initiative on foundation AI models. It is a framework for…
Aug 21, 2024
22
Aug 21, 2024
22
Combines SigLIP vision model and the Gemma language model and follows the PaLI-3 training recipe to achieve strong performance on various vision-language tasks.
Ritvik Rastogi
Papers Explained 197: Pali GemmaPaliGemma is an open model that continues the line of PaLI vision-language models by combining the SigLIP-So400m vision encoder with the…
Aug 29, 2024
30
Aug 29, 2024
30
A family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.
Ritvik Rastogi
Papers Explained 143: ChameleonChameleon is a family of early-fusion token-based mixed-modal models capable of reasoning over and generating interleaved image-text…
May 29, 2024
20
May 29, 2024
20
A more lightweight variant of the Gemini 1.5 pro, designed for efficiency with minimal regression in quality, making it suitable for applications where compute resources are limited.
Ritvik Rastogi
Papers Explained 142: Gemini 1.5 FlashThe tech report introduces two new models: Gemini 1.5 Pro and Gemini 1.5 Flash.
May 27, 2024
60
May 27, 2024
60
An omni model accepting and generating various types of inputs and outputs, including text, audio, images, and video.
Ritvik Rastogi
Papers Explained 185: GPT-4oGPT-4o is an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any…
Aug 14, 2024
30
Aug 14, 2024
30
Provides a comprehensive introduction to VLMs, covering their definition, functionality, training methods, and evaluation approaches, aiming to help researchers and practitioners enter the field and advance the development of VLMs for various applications.
Ritvik Rastogi
Papers Explained: An Introduction to Vision-Language ModelingThe development of Vision-Language Models (VLMs), aims to connect vision to language and enable applications such as visual assistants and…
Aug 22, 2024
37
Aug 22, 2024
37
Improvement upon Idefics1 with enhanced OCR capabilities, simplified architecture, and better pre-trained backbones, trained on a mixture of openly available datasets and fine-tuned on task-oriented data.
Ritvik Rastogi
Papers Explained 180: Idefics 2Idefics2 is a family of general multimodal model that takes in arbitrary sequences of text and images and generates text responses. It can…
Aug 8, 2024
20
Aug 8, 2024
20
A multimodal llm that combines a ViT-H image encoder with 378x378px resolution, pretrained on a data mix of image-text documents and text-only documents, scaled up to 3B, 7B, and 30B parameters for enhanced performance across various tasks
Ritvik Rastogi
Papers Explained 117: MM1MM1 is a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts variants, that are SOTA…
Mar 26, 2024
20
Mar 26, 2024
20
A family of VLMs consisting of Haiku, Sonnet, and Opus models, sets new industry standards for cognitive tasks, offering varying levels of intelligence, speed, and cost-efficiency.
Ritvik Rastogi
Papers Explained 181: ClaudeThe Claude 3 model family, announced by Anthropic, introduces three advanced models: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus…
Aug 8, 2024
20
Aug 8, 2024
20
A highly compute-efficient multimodal mixture-of-experts model that excels in long-context retrieval tasks and understanding across text, video, and audio modalities.
Ritvik Rastogi
Papers Explained 105: Gemini 1.5 ProGemini 1.5 Pro marks a significant milestone in the evolution of multi-modal mixture-of-experts models, pushing the boundaries of compute…
Feb 26, 2024
123
Feb 26, 2024
123
An improved version of a LLaVA 1.5 with enhanced reasoning, OCR, and world knowledge capabilities, featuring increased image resolution
Ritvik Rastogi
Papers Explained 107: LLaVA 1.6LLaVA 1.6 is an advancement LLaVA 1.5 featuring enhanced reasoning, OCR, and world knowledge capabilities, surpassing its predecessor and…
Mar 1, 2024
23
Mar 1, 2024
23