Papers Explained 332: Aya Vision

5 min readMar 18, 2025

Aya Vision builds on the success of Aya Expanse, state-of-the-art multilingual language models, and extends it using a combination of advanced techniques, including synthetic annotations, scaling up multilingual data through translation and rephrasing, and multimodal model merging.

The models are available on HuggingFace.

Architecture

To process images with arbitrary resolutions, especially high-resolution images, dynamically resize and split any higher-resolution images into multiple tiles to generate rich image features from the image encoder. Aya Vision models use the recently released SigLIP2-patch14–384 model as the initialization for the vision encoder.

As dynamic resizing leads to a larger number of image tokens passing through the vision-language connector and LLM decoder, a downsampling method called Pixel Shuffle is used to compress the number of image tokens by 4x. After downsampling, image tokens are aligned to the language model input embeddings through a vision-language connector and passed to an LLM decoder.

For Aya Vision 8B, an LLM that is initialized from Cohere Command R7B further post-trained using the Aya Expanse recipe is used. For Aya Vision 32B, the language model is initialized from Aya Expanse 32B.

Training process

We trained Aya Vision models in 2 stages — vision-language alignment and supervised fine-tuning (SFT).

In the vision-language alignment stage, only the vision-language connector is trained, while the vision encoder and the language model weights are kept frozen.

In the SFT stage, we train both the connector and the language model on a diverse set of multimodal tasks in 23 languages.

Multimodal Data Enhancement and Expanding Language Coverage

To ensure strong performance across underrepresented languages, synthetic annotations are first gathered using a diverse pool of high-quality datasets in English. Following the synthetic annotations of English datasets, a large volume of the data is translated into 23 languages. To avoid translation artefacts and maintain fluent textual characteristics with high precision in answers, translated prompt/generation pairs are rephrased by matching them with the original high-quality synthetic samples, expanding language coverage where real-world datasets are scarce.

The 8B model, when only supervised fine-tuned with original academic datasets, reaches a 40.9% win rate across 23 languages in AyaVisionBench against Pangea 7B, which is a multilingual VLM, whereas synthetic annotations and scaling up the multilingual data lead to a 58.1% win rate with a gain of 17.2%.

Multimodal Model Merging

To make the model generate high-quality responses to both image and text inputs, the base language model is merged with the fine-tuned vision-language model. Model merging enhances the generative capabilities of the final model, which leads to a 70% win rate across 23 languages on AyaVisionBench against Pangea 7B.

Scaling up to 32B

Finally, the recipe is scaled from 8B to 32B, resulting in Aya Vision 32B. Aya Vision 32B shows significant improvements in win rates due to the stronger initialization of the text-backbone, and outperforms models more than 2x of its size.

Aya Vision Benchmark

AyaVisionBench, constructed based on real-world applications, covers 23 languages and 9 distinct task categories, with 135 image-question pairs per language. This dataset is designed to assess a model’s ability to perform a diverse range of vision-language tasks, including captioning, chart and figure understanding, identifying differences between two images, general visual question answering, OCR, document understanding, text transcription, reasoning involving logic and math, and converting screenshots to code.

To create this dataset, images are selected from the Cauldron held-out test set, a large collection derived from 50 high-quality datasets, ensuring they had not been seen during training. For each image, a corresponding question that explicitly requires visual context for an answer is generated. These questions are synthetically generated and subsequently refined through a human annotator review and validation to ensure each question is clear, relevant, and truly dependent on the image.

Evaluation

In pair-wise comparison, Aya Vision 32B outperforms models more than 2x of its size, such as Llama-3.2 90B Vision, Molmo 72B, and Qwen2.5-VL 72B by win rates ranging from 50% to 64% on AyaVisionBench and 52% to 72% on mWildVision average across 23 languages.
Aya Vision 8B achieves the best performance in multilingual multimodal in its parameter class, outperforming leading models such as Qwen2.5-VL 7B, Pixtral 12B, Gemini Flash 1.5 8B, Llama-3.2 11B Vision, Molmo-D 7B, and Pangea 7B by up to 79% win-rates on AyaVisionBench and 81% on mWildBench.