List: Language Models | Curated by Ritvik Rastogi

Oct 15, 2024
127 stories
11 saves
Language Models
A large language model trained with reinforcement learning to think before answering, producing a long internal chain of thought before responding.
Ritvik Rastogi
Papers Explained 211: o1OpenAI o1 is a large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers — it can…
Sep 16, 2024
1
Sep 16, 2024
1
Applies the minitron approach to Llama 3.1 8B and Mistral-Nemo 12B, additionally applies teacher correction to align with the new data distribution.
Ritvik Rastogi
Papers Explained 209: Minitron Approach in PracticeThis work presents a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters…
Sep 12, 2024
Sep 12, 2024
Neutrally generalist instruct and tool use models, created by fine-tuning Llama 3.1 models with strong reasoning and creative abilities, and are designed to follow prompts neutrally without moral judgment or personal opinions.
Ritvik Rastogi
Papers Explained 188: Hermes 3Hermes 3 are neutrally-aligned generalist instruct and tool use models created by fine-tuning Llama 3.1 8B, 70B, and 405B, with strong…
Aug 19, 2024
Aug 19, 2024
A suite of pre-trained models designed for code optimization tasks, built upon Code Llama, with two sizes (7B and 13B), trained on LLVM-IR and assembly code to optimize compiler intermediate representations, assemble/disassemble, and achieve high accuracy in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR.
Ritvik Rastogi
Papers Explained 223: LLM CompilerThe LLM Compiler is a suite of pre-trained models designed for code optimization tasks, specifically for compiler intermediate…
Oct 2, 2024
Oct 2, 2024
Two foundation language models, AFM-on-device (a ~3 B parameter model) and AFM-server (a larger server-based model), designed to power Apple Intelligence features efficiently, accurately, and responsibly, with a focus on Responsible AI principles that prioritize user empowerment, representation, design care, and privacy protection.
Ritvik Rastogi
Papers Explained 222: Apple Intelligence Foundation Language ModelsApple Foundation Models are developed as part of Apple Intelligence, a personal intelligence system integrated into iOS 18, iPadOS 18, and…
Oct 1, 2024
Oct 1, 2024
Prunes an existing Nemotron model and re-trains it with a fraction of the original training data, achieving compression factors of 2-4×, compute cost savings of up to 40×, and improved performance on various language modeling tasks.
Ritvik Rastogi
Papers Explained 208: MinitronThe study investigates whether pruning an existing Large Language Model (LLM) and re-training it with a fraction of the original training…
Sep 11, 2024
1
Sep 11, 2024
1
Additional experiments of adding multimodal capabilities to Llama3.
Ritvik Rastogi
Papers Explained 187c: Llama 3.1 — Multimodal ExperimentsLlama 3 is a new set of foundation models, designed for multilinguality, coding, reasoning, and tool usage. The largest model boasts 405B…
Aug 16, 2024
Aug 16, 2024
A family of multilingual language models ranging from 8B to 405B parameters, trained on a massive dataset of 15T tokens and achieving comparable performance to leading models like GPT-4 on various tasks.
Ritvik Rastogi
Papers Explained 187b: Llama 3.1Llama 3 is a new set of foundation models, designed for multilinguality, coding, reasoning, and tool usage. The largest model boasts 405B…
Aug 16, 2024
Aug 16, 2024
A family of small models with 135M, 360M, and 1.7B parameters, utilizes Grouped-Query Attention (GQA), embedding tying, and a context length of 2048 tokens, trained on a new open source high-quality dataset.
Ritvik Rastogi
Papers Explained 176: Smol LMSmolLM is a series of state-of-the-art small language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are…
Aug 6, 2024
Aug 6, 2024
A fine tuned Mistral-7B through Generative Teaching via synthetic data generated through the proposed AgentInstruct framework, which generates both the prompts and responses, using only raw data sources like text documents and code files as seeds.
Ritvik Rastogi
Papers Explained 164: Orca 3 (Agent Instruct)The study focuses on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior…
Jul 17, 2024
Jul 17, 2024
Utilizes interleaving local-global attentions and group-query attention, trained with knowledge distillation instead of next token prediction to achieve competitive performance comparable with larger models.
Ritvik Rastogi
Papers Explained 157: Gemma 2Gemma 2 is a new addition to the Gemma family with several technical modifications, including interleaving local-global attentions and…
Jul 1, 2024
Jul 1, 2024
A 15B multilingual language model trained on 8T text tokens by Nvidia.
Ritvik Rastogi
Papers Explained 206: Nemotron-4 15BNemotron-4 15B is a large multilingual language model trained on 8T text tokens by Nvidia.It exhibits high downstream accuracies across a…
Sep 9, 2024
Sep 9, 2024
A family of multilingual language models supporting 23 languages, designed to balance breadth and depth by allocating more capacity to fewer languages during pre-training.
Ritvik Rastogi
Papers Explained 151: Aya 23Aya 23 is a family of multilingual language models that can serve 23 languages. It is an improvement over the previous model, Aya 101…
Jun 17, 2024
Jun 17, 2024
A family of code models ranging from 3B to 34B trained on 3.5-4.5T tokens of code written in 116 programming languages.
Ritvik Rastogi
Paper Explained 144: Granite Code ModelsThis paper introduces a series of decoder-only code models (3B, 8B, 20B, 34B) for code generative tasks, trained with code written in 116…
May 31, 2024
May 31, 2024
A more lightweight variant of the Gemini 1.5 pro, designed for efficiency with minimal regression in quality, making it suitable for applications where compute resources are limited.
Ritvik Rastogi
Papers Explained 142: Gemini 1.5 FlashThe tech report introduces two new models: Gemini 1.5 Pro and Gemini 1.5 Flash.
May 27, 2024
May 27, 2024
A fully open language model designed to enhance accuracy while using fewer parameters and pre-training tokens. Utilizes a layer-wise scaling strategy to allocate smaller dimensions in early layers, expanding in later layers.
Ritvik Rastogi
Papers Explained 134: Open ELMOpenELM is an open language model by Apple with not only open source model weights and inference code but the complete framework for…
May 8, 2024
May 8, 2024
A series of language models trained on heavily filtered web and synthetic data set, achieving performance comparable to much larger models like Mixtral 8x7B and GPT-3.5.
Ritvik Rastogi
Papers Explained 130: Phi-3phi-3-mini is a 3.8B language model trained on 3.3T tokens data which is a scaled-up version of the one used for phi-2, composed of heavily…
Apr 29, 2024
Apr 29, 2024
Introduces Selective Language Modelling that optimizes the loss only on tokens that align with a desired distribution, utilizing a reference model to score and select tokens.
Ritvik Rastogi
Papers Explained 133: Rho-1The study analyzes token-level training dynamics of language models, revealing distinct loss patterns for different tokens. RHO-1 leverages…
May 6, 2024
May 6, 2024
Based on Griffin, uses a combination of linear recurrences and local attention instead of global attention to model long sequences efficiently.
Ritvik Rastogi
Papers Explained 132: RecurrentGemmaRecurrentGemma-2B is an open model based on the Griffin architecture. It uses a combination of linear recurrences and local attention…
May 3, 2024
May 3, 2024
Open code models based on Gemma models by further training on over 500 billion tokens of primarily code
Ritvik Rastogi
Papers Explained 124: CodeGemmaCodeGemma is a collection of open code models built on top of Gemma by further training on more than 500 billion tokens of code, capable of…
Apr 15, 2024
Apr 15, 2024