Papers Explained 282: Tulu V2

Ritvik Rastogi
6 min readJan 7, 2025

--

Since the release of TÜLU, open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. This has resulted in TÜLU 2.

TÜLU 2 releases:

  • TÜLU-V2-mix, an improved collection of high-quality instruction datasets
  • TÜLU 2, LLAMA-2 models fine tuned on the V2 mixture
  • TÜLU 2+DPO, TÜLU 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (TÜLU 2+DPO 70B)
  • Code TÜLU 2, Code LLAMA models fine tuned on our V2 mix.

TÜLU V2

For TULU v2 all possible LLAMA-2 sizes: 7B, 13B, and 70B, and Code LLAMA sizes: 7B, 13B, and 34B are fine tuned. The V2 mixture, TÜLU-V2-mix, comprises of data from the following sources:

  • FLAN: 50,000 examples sampled from FLAN v2.
  • CoT: Another 50,000 examples sampled from the CoT subset of the FLAN v2 mixture to emphasize chain-of-thought (CoT) reasoning.
  • Open Assistant 1: The highest-scoring paths in each conversation tree are used, resulting in 7,708 examples. Scores are taken from the quality labels provided by the original annotators of Open Assistant 1.
  • ShareGPT2: All 114,046 examples from the processed ShareGPT dataset are used, as including the ShareGPT dataset resulted in strong performance in prior work.
  • GPT4-Alpaca: 20,000 samples are sampled from GPT-4 Alpaca to further include distilled GPT-4 data.
  • Code-Alpaca: All 20,022 examples from Code Alpaca are used, following the prior V1 mixture, in order to improve model coding abilities.
  • LIMA: 1,030 examples from LIMA are used as a source of carefully curated data.
  • WizardLM Evol-Instruct V2: 30,000 examples are sampled from WizardLM, which contains distilled data of increasing diversity and complexity.
  • Open-Orca: 30,000 examples generated by GPT-4 from OpenOrca are used. OpenOrca is a reproduction of Orca, which augments FLAN data with additional model-generated explanations.
  • Science literature: 7,544 examples from a mixture of scientific document understanding tasks — including question answering, fact-checking, summarization, and information extraction are included.
  • Hardcoded: A collection of 140 samples using prompts such as ‘Tell me about yourself’ manually written, such that the model generates correct outputs given inquiries about its name or developers are included.

After filtering, the V2 mixture consists of 326,154 samples, compared to 490,445 in the V1 mixture. The context length during training is expanded from a maximum of 2,048 tokens to 8,192 tokens.

The direct preference optimization (DPO) algorithm is used due to the simplicity of its implementation. For DPO training, the Zephyr-Beta approach is followed: training on a filtered and binarized form of UltraFeedback for three epochs. A low learning rate, 5 × 10−7, is required for stable and effective DPO training. This significantly improves performance on open-ended generation evaluations.

QLoRA training is experimented with at the instruction tuning stage to determine if compute demands can be reduced without reducing performance. Due to sub-par performance at the instruction tuning stage, QLoRA is not explored during RLHF training.

Evaluation

Overall Results

The evaluation metrics of the core TÜLU-2 suite and its peers.
  • TÜLU-2 70B outperforms all open-source models on average across the seven benchmark tasks.
  • TÜLU-2 70B is the top-performing open-source model in three out of seven tasks. In the remaining four tasks, its performance is very close to the best-performing model (within 1% on average).
  • TÜLU-2 is competitive with GPT-3.5–0301, showing similar or better performance in several tasks.
  • A significant performance gap exists between TÜLU-2 and GPT-4, and a moderate gap with GPT-3.5-turbo-0613.
  • Larger TÜLU-2 models generally perform better, demonstrating a positive scaling trend.

TÜLU V1 vs V2 Data Mixtures

Results of LLAMA-2 models finetuned on the V1 and V2 data mixtures, and ShareGPT.
  • Models trained on the V2 mix generally outperform models trained on the V1 mix across multiple evaluation metrics (BBH, Codex-Eval, AlpacaEval, TruthfulQA), except for GSM8k and TydiQA, suggesting potential trade-offs in specific capabilities (e.g., multilingual capabilities).
  • The V2 mix consistently outperforms training solely on ShareGPT, likely due to the V2 mix’s inclusion of distilled datasets with similar origins to ShareGPT.
  • The performance improvement gained by using the V2 mix decreases with increasing model size, suggesting diminishing returns of data quality improvements as model size grows.

Scaling DPO Training

Evaluation results for TÜLU V2 models with and without DPO finetuning.
MT-Bench and AlpacaEval results, along with average output length of AlpacaEval responses.
  • DPO training significantly improves performance on AlpacaEval and MT-Bench, particularly for larger models (13B and 70B). TÜLU 2+DPO 70B achieves state-of-the-art performance among open models on MT-Bench and is second-best on AlpacaEval.
  • DPO training scales effectively to 70B parameter models, demonstrating stable training and performance improvements even at large scales. TÜLU 2+DPO 70B is the largest publicly released DPO-trained model.
  • DPO training does not negatively impact most other metrics (MMLU, BBH, GSM8k), suggesting it doesn’t broadly alter model capabilities.
  • DPO training significantly reduces multilingual capabilities (TydiQA), likely due to the lack of multilingual data in the training datasets. This suggests incorporating multilingual data in future training could mitigate this issue.
  • DPO training increases model verbosity, a common observation in RLHF training, but the increase is less pronounced than in other open-weight models.

Parameter-efficient Fine Tuning

Results from LLAMA-2 models finetuned with and without QLoRA on the V2 mix.
  • QLoRA underperforms full finetuning on open-ended generation tasks, as measured by AlpacaEval.
  • However, the performance gap between QLoRA and full finetuning decreases as model size increases, suggesting potential parity at larger scales.

Improving Code Performance with Code LLAMA

Evaluation results comparing models based on Code LLAMA with the TÜLU models.
  • CODE TÜLU 2 models significantly outperformed TÜLU 2 models on coding tasks, with the smallest CODE TÜLU 2 model (7B) matching the performance of the largest TÜLU 2 model (70B) on Codex-Eval. This demonstrates the benefit of using smaller, domain-specific models for coding tasks.
  • CODE TÜLU 2 and TÜLU 2 showed drastically different performance across non-coding tasks. TÜLU 2 outperformed CODE TÜLU 2 on 4 out of 8 non-coding benchmarks (MMLU, GSM8k, AlpacaEval, TruthfulQA), while CODE TÜLU 2 performed better on others (BBH, TydiQA, ToxiGen, Codex-Eval). There is a significant drop in AlpacaEval performance for CODE TÜLU 2 (around 20%).
  • CODE TÜLU 2 models outperformed both base CODE LLAMA and CODE LLAMA-Instruct models in 5 out of 8 evaluation settings, highlighting the effectiveness of the V2 data mixture. The superior performance of CODE LLAMA-Instruct on AlpacaEval suggests that the V2 mixture might prioritize general open-ended queries over specific model capabilities.

Paper

Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2 2311.10702

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!

--

--

Ritvik Rastogi
Ritvik Rastogi

Written by Ritvik Rastogi

Data Scientist, 2x Kaggle Expert

No responses yet