List: Datasets | Curated by Ritvik Rastogi

Oct 29, 2024

8 stories

3 saves

Datasets

A high-quality dataset of detailed image descriptions collected through speech-based annotations, enabling the creation of more robust and accurate VLMs.

Ritvik Rastogi

Papers Explained 241: Pixmo and Molmo

Molmo (Multimodal Open Language Model) utilizes PixMo (Pixels for Molmo), a high-quality dataset of detailed image captions collected from…

Oct 29, 2024

Oct 29, 2024

An open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages, 353M associated images, and 115B text tokens, extracted from CommonCrawl.

Ritvik Rastogi

Papers Explained 179: Obelics, Idefics

This work curates Obelics, an openly-accessible web-scale dataset consisting of 141M multimodal English web documents containing 353M…

Aug 7, 2024

Aug 7, 2024

A massive dataset for DocVQA containing 2.4M images, 9.5M question-answer pairs, and 1.3M PDF documents, generated by taking transcriptions from the PDFA OCR dataset and using a Phi-3-small model to generate Q/A pairs.

Ritvik Rastogi

Papers Explained 178: Docmatix

Docmatix is a large-scale dataset for Document Visual Question Answering (DocVQA) that is hundreds of times larger than previously…

Aug 7, 2024

Aug 7, 2024

A large-scale dataset for pretraining LLMs, consisting of 15T tokens, shown to produce better-performing models than other open pretraining datasets.

Ritvik Rastogi

Papers Explained 174: FineWeb

FineWeb is a large-scale dataset for pretraining LLMs, consisting of 15T tokens and 44TB of disk space. It was created by combining 96…

Aug 5, 2024

Aug 5, 2024

Synthetic Data containing over 30M files and 25B tokens, generated by Mixtral-8x7B-Instruct-v0., aimed to reproduce the training data for Phi-1.5.

Ritvik Rastogi

Papers Explained 175: Cosmopedia

Cosmopedia aims to reproduce the training data used for Phi-1.5. It is a dataset of synthetic textbooks, blog posts, stories, posts, and…

Aug 5, 2024

Aug 5, 2024

A synthetic dataset consisting of 2M pairs of HTML codes and their corresponding screenshots, generated through LLMs, aimed to accelerate research for converting a screenshot into a corresponding HTML.

Ritvik Rastogi

Papers Explained 177: WebSight

Despite VLMs have made significant progress in various tasks, converting website screenshots into functional HTML code has been minimally…

Aug 6, 2024

Aug 6, 2024

A human-curated instruction-following dataset that spans 65 languages, created to bridge the language gap in datasets for natural language processing.

Ritvik Rastogi

Papers Explained 108: Aya Dataset

This work contributes four key resources: the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite.

Mar 4, 2024

Mar 4, 2024

An open corpus of three trillion tokens designed to support language model pretraining research.

Ritvik Rastogi

Papers Explained 97: Dolma

Dolma (Data for Open Language Models’ Appetite) is an open corpus of three trillion tokens designed to support language model pretraining…

Feb 5, 2024

Feb 5, 2024