Papers Explained 283: Tulu V3
TÜLU 3 is a family of fully-open state-of-the-art post-trained models. It includes data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. TÜLU 3 builds on Llama 3.1 base models and achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for the models include supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and a novel method called Reinforcement Learning with Verifiable Rewards (RLVR).
The models and datasets are available at HuggingFace.
Overview
The TÜLU 3 effort began with identifying key areas where open post-training recipes often fall behind and that are desirable capabilities for generalist language models.
- Data Curation: A variety of prompts are curated to be allocated across multiple stages of optimization. New synthetic prompts or, when available, prompts from existing datasets are created to target specific capabilities.
- Supervised Fine Tuning: Supervised finetuning (SFT) is performed on carefully selected prompts and completions. Through thorough experimentation, the final SFT data and training hyperparameters are determined to enhance target core skills without significantly impacting the performance of others.
- Preference Tuning: Preference tuning, specifically DPO, is applied to newly curated on-policy synthetically created preference data from selected prompts along with off-policy data. As in the SFT stage, the best preference data mix is identified through thorough experimentation, uncovering what formats of data, methods, or hyperparameters lead to improvements.
- Reinforcement Learning with Verifiable Rewards: A new RL-based post-training stage trains the model on verifiable rewards instead of a reward model, as is common for traditional RLHF PPO training. Tasks with verifiable outcomes, such as mathematical problem-solving, are selected and rewards are only provided when the model’s generations are verified to be correct. RL is then used to train on these rewards.
TÜLU 3 Data
Sourcing from Public Datasets
The process begins with a broad survey of public datasets, including those annotated by dedicated workers, sourced from real users, and synthesized with models. Each individual dataset is then manually reviewed, and those meeting the following considerations are selected:
Diversity: Datasets that promote diversity are chosen, including WildChat, a large source of real-user interaction with models; Open Assistant, created by volunteer workers for general chatting; No Robots, annotated by expert workers for a broad range of open-ended categories; and FLAN v2, a big compilation of classical NLP tasks. A decontaminated subset of UltraFeedback, a composition of several datasets (FalseQA, UltraChat, Evol-Instruct, FLAN v2) and showing strong performance for general preference tuning, is also included.
Target Skills: OpenMathInstruct and NuminaMath are included for math reasoning, Evol-CodeAlpaca for coding, a subset of Daring-Anteater for precise instruction following, Aya for multilinguality, SciRIFF for scientific literature understanding, and TableGPT for processing table-related tasks.
Synthesizing for Target Skills
To ensure diversity in generation, the persona-driven methodology is followed to generate synthetic data. The key idea is to use different personas (e.g., “A machine learning researcher focused on neural networks”) with a data synthesis prompt (e.g., “create a coding problem”) to steer an LLM to synthesize data with corresponding perspectives. Specifically, ∼250K personas from Persona Hub are conditioned on to generate prompts targeting specific skills such as precise instruction following, math and coding.
Precise instruction following is the ability to follow verifiable instructions in natural language, such as “write 3 paragraphs,” that can be automatically verified with heuristics. Verifiable instructions covering 25 different constraint types defined in IFEval benchmark are generated. Manually writing 1–2 example instructions per constraint results in a total of 33 verifiable instructions which are used as seed prompts. New instructions are then generated using GPT-4o given a data synthesis prompt, persona, and a single verifiable instruction as an example. In total, 29,980 verifiable instruction-response pairs are collected and called IF-PERSONA-SFT.
For math and coding, GPT-4o is zero-shot prompted to generate problems that are unique and specific to a given persona input. Having generated the problems, multi-step math solutions are generated using GPT-4o, and python programs using claude-3–5-sonnet. In total, ∼220K and 35K instances are collected for math reasoning and coding.
It is crucial to ensure models can reliably reject unsafe and appropriately handle nuanced and out of scope queries. A set of noncompliance queries that the model ought to not comply with, alongside safety-related direct and adversarial prompts covering both benign and harmful scenarios are curated. Noncompliance prompts are obtained based on a contextual noncompliance taxonomy spanning multiple categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). Safety-related prompts are carefully selected among synthetic adversarial prompts, synthetic vanilla (direct) requests, real-world user-LLM interactions (In-The-Wild), and curated annotator-written examples to maximize coverage, diversity, and balance.
Prompt Decontamination
Experiments were conducted with full-string, n-gram, and embedding-based matching. N-gram matching yielded the most useful results. While embedding-based methods can in principle identify non-trivial contamination like that due to paraphrasing, it was difficult to distinguish mere distributional similarity from actual paraphrasing.
For each token in a test instance, it is considered to match a token in a train instance if the two instances share an 8-gram containing that token. A test instance itself is considered to have significant overlap with a train instance if more than 50% of the test tokens have 8-gram matches with the same training instance.
A training set is considered contaminated if any number of its instances overlap with more than 2% of the instances in any of the evaluations in the development and unseen suites. All the training sets that were contaminated with the unseen evaluations were removed.
Supervised Fine Tuning
For prompts with existing responses, the original response is kept if it was written by a human or a frontier model, like GPT-4o. Empty responses and responses that contain information about models or their developers are additionally filtered. If a set of prompts did not have responses, like Persona prompts, or if the original responses were from a weaker model (e.g. WildGuardMix), new responses are generated using GPT-4o. Hardcoded prompts have hand-written responses.
To develop the SFT mix, skills that were lagging behind state of the art models using Llama 3.1 trained on TÜLU 2 were first identified. Targeting each of these skills in isolation, high quality publicly available datasets and synthetic datasets were collected.
These mixtures were then combined to create the initial TÜLU 3 preview mix. The mixture was then iterated on by adding or removing datasets to improve lagging skills, decontaminating against evaluations and downsampling particularly large datasets.
Key Data Experiments
TÜLU 3 8B SFT and TÜLU 3 70B SFT models are compared against other SFT-only models trained on Llama 3 8B or 70B. A series of controlled experiments were run after developing the final SFT mix to explore the importance of different decisions made during data mixing and training.
- The new TÜLU 3 SFT mix significantly outperforms the previous TÜLU 2 mix and other competitive 8B SFT models.
- Removing WildChat, a source of diverse chat data, leads to a small but noticeable performance degradation across multiple skills, particularly on AlpacaEval. This highlights the importance of diverse real-world data.
- Safety-specific SFT data is largely orthogonal to other datasets in the mix. Removing it primarily affects safety performance, while other skills remain relatively unchanged. Contrastive prompts help prevent over-refusal of safe prompts.
- New Persona datasets targeting mathematics, coding, and instruction following improve performance on HumanEval(+), GSM8K, MATH, and IFEval. Removing these datasets leads to a drop in performance on these metrics.
- Data targeting specific skills, such as mathematics, significantly improves performance on relevant metrics (GSM8K and MATH). Removing mathematics-specific data leads to a substantial drop in performance on these metrics.
- Model performance generally improves with increasing amounts of SFT data, with notable gains on GSM8K. However, TruthfulQA performance decreases as the data size increases. Further increases in SFT data size were not explored due to allocation of prompts for preference optimization.
Key Training Experiments
Base Model Choice: Both larger model size (70B vs. 8B) and domain-specific pretraining data (Qwen 2.5 Math vs. Qwen 2.5) significantly improved performance on GSM8K and MATH datasets.
Chat Template Variation: Replacing newlines with eos tokens in the chat template yielded the best performance, but this approach was not adopted due to potential inconsistencies in later post-training stages. Removing the newline at the end of the template had some impact, but the best performing template was included in the codebase.
Random Seeds and Model Soups: Performance varied noticeably across different random seeds. The best performing model soup did not consistently outperform the best single model trained with a specific seed. Therefore, the best single SFT run for each model size was selected for the final models.
Preference Fine Tuning
Our data creation pipeline consists of three stages: prompt selection, response generation from a pool of models, and preference annotation with LLM-as-a-judge to create (preferred, rejected) pairs.
Key Findings of Data Ablations
- Scaling with Unique Prompts: Increasing the number of unique prompts improves downstream DPO performance across several metrics.
- Scaling with Duplicated Prompts: Duplicating prompts with different responses does *not* significantly improve DPO performance, and may even degrade it. Investing in unique prompts is more effective.
- Unused vs. Reused Prompts: Using new, unused prompts from the same datasets as the SFT (Supervised Fine-Tuning) stage leads to slightly better DPO performance than reusing SFT prompts. Combining unused and reused prompts yields the best results.
- On-Policy Data: Including on-policy data (text generations from the SFT model) improves downstream DPO performance compared to using only off-policy data.
- LLM Judge Comparison: GPT-4o, Llama 3.1 (405B), and GPT-4 turbo perform similarly as preference judges, with GPT-4o slightly ahead. GPT-4o was chosen for the synthetic pipeline due to ease of use, cost, and batch inference speed.
- Beyond Ultrafeedback: The TÜLU 3 preference mix significantly outperforms training on Ultrafeedback alone, especially for the 70B model. This suggests the limitations of using existing preference datasets with less capable models’ completions. Helpsteer2 also performs worse than the best mix.
- Persona Preference Data: Only the Persona IF (Instruction Following) preference data improves average and targeted IFEval scores. Persona Math and Code data do not improve their respective evaluations and slightly harm the average score. Only Persona IF was included in the final mix.
- Targeting Instruction Following: Several methods were explored: Persona IF (rewriting prompts to relax constraints), IF-augmented (combining instructions with constraints), and WildChat IF (using prompts with constraints from WildChat). IF-persona significantly improves IFEval scores. IF-augmented-verified (filtered IF-augmented) offers a smaller improvement. Combining both yields the best IFEval performance.
- WildChat Data: Adding WildChat preference data generally improves DPO performance. Reusing WildChat prompts from SFT training performs better than combining reused and unused prompts.
- Regenerated Preference Datasets: Regenerating completions and preferences for existing datasets (Helpsteer2, Ultrafeedback, MultiPref) using the synthetic pipeline improves downstream DPO performance compared to the original datasets.
Reinforcement Learning on Verifiable Rewards
RLVR, a novel method for training language models, leverages the existing RLHF objective but replaces the reward model with a verification function.
Data for RLVR is created by obtaining prompts with an accompanying binary verifier. Two domains (mathematics, exact instruction following) and three evaluations (GSM8K, MATH, IFEval) are focused on, utilizing relatively straightforward methods for verification.
For GSM8K, each sample is augmented with the standard 8-shot prompt used during evaluation to encourage the model to use chain-of-thought. The final number produced is then extracted and compared to the ground-truth label to determine correctness.
In MATH, similar to GSM8K, each sample is augmented with the standard 3-shot CoT prompt to encourage chain-of-thought generation during evaluation. The answer is then extracted and correctness is determined following the ‘flex’ MATH evaluation logic.
IFEval utilizes randomly sampled instructions from the TÜLU 2 SFT mix combined with constraints. A verification function exists for each constraint template, able to verify whether a completion satisfies a constraint.
With these prompts and verification functions, models are trained via Proximal Policy Optimization (PPO).
Key Findings
- RLVR improves performance: Training with RLVR leads to improved test performance across all three settings, outperforming the initial model in each evaluation. Verifiable rewards (correctness on the training set) also consistently improve. Increased KL budget doesn’t always correlate with improved verifiable rewards in GSM8K and MATH.
- Starting point influences KL divergence but not final verifiable rewards: Both weaker (SFT) and stronger (DPO) initial models can achieve the same level of verifiable rewards, but the weaker model requires a larger KL divergence. However, starting from a stronger model generally results in better test set performance.
- Initializing from a general RM is beneficial: Initializing RLVR’s value function from a general reward model achieves the highest GSM8K test score and higher average scores, highlighting the importance of the value function.
- Verifiable rewards alone are superior: Using only verifiable rewards outperforms using scores from the reward model, as the latter introduces noise.
Paper
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training 2411.15124
Hungry for more insights?
Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!
Do Subscribe for weekly updates!!