Papers Explained 184: Instruction Pretraining

5 min readAug 13, 2024

Instead of directly pre-training on raw corpora, Instruction Pre-Training augments each text from the raw corpora with a set of instruction-response pairs generated by an instruction synthesizer, then pre-trains LMs on the augmented corpora.

Comparison between Instruction PreTraining and Vanilla Pre-Training.

200M instruction-response pairs covering 40+ task categories are synthesized in this work to verify the effectiveness of Instruction Pre-Training.

The models are available at HuggingFace.

Instruction Synthesizer

The instruction synthesizer is developed through multitask fine-tuning on a language model. The tuning data are curated to be highly diverse, enabling the instruction synthesizer to generalize to unseen data. Furthermore, specific designs are incorporated to synthesize both one-shot and few-shot examples for subsequent pre-training.

Datasets for Fine-Tuning the Instruction Synthesizer

For each context in the datasets, all the corresponding downstream tasks are gathered, the context is regarded as the raw text and the downstream tasks as the instruction-response pairs. For each dataset, a maximum of 10K examples with the highest number of instruction-response pairs are sampled, to enhance task diversity while avoiding dataset predominance.

A one-shot example consists of a raw text (TN) and a set of instruction response pairs (I_N R_N); data denoted without ′ are for tuning the instruction synthesizer, and data with ′ are for synthesizer inference and LM pre-training.

Templates for Different Formats of Instruction-Response Pairs

During instruction synthesizer tuning, each sequence fed into the synthesizer concatenates multiple one-shot examples sampled from the same dataset. During inference, multi round inference is conducted to synthesize instruction response pairs with patterns similar to those of previous rounds.

LM Pre-Training

Except for the pre-training data, Instruction PreTraining keeps all other pre-training settings the same as Vanilla Pre-Training: training with the next-token prediction objective and computing loss on all tokens.

General Pre-Training From Scratch

Considering the large amount of data required for general pre-training from scratch, only a part of the raw corpora is converted into instruction-augmented corpora, leaving the rest unchanged.

Domain-Adaptive Continual Pre-Training

For domain-adaptive continual pre-training, the data requirement is much smaller. Therefore, all raw corpora are converted into instruction-augmented corpora. The corpora is then mixed with the general instructions to benefit from improved prompting ability.

Experiment Settings

Instruction Synthesizer

The synthesizer is fine-tuned from Mistral-7Bv0.1. During inference, about 5 instruction-response pairs are created per raw text, where each pair contains about 52 tokens.

General Pre-Training From Scratch

Refined dataset is randomly sampled for raw pre-training corpora, consisting of 200M pieces of text containing about 100B tokens.

To create instruction-augmented corpora, two rounds of instruction synthesis is done, converting 1/5 of the raw corpora (40M raw texts) into instruction-augmented texts.

Since the fine-tuning data amount (0.2B tokens) is too small compared to that of the raw corpora, its sample ratio is increased so that it repeats 4 times throughout pre-training.

The architecture and tokenizer of Mistral is adopted to implement models of two different parameters: 500M and 1.3B.

Domain-Adaptive Continual Pre-Training

Raw corpora from two domains is used: PubMed Abstracts for biomedicine and financial news for finance. Then the instruction-augmented corpora is mixed with general instructions.

Llama3–8B is continually pretrained on each domain.

Hyper-Parameters of Pre-Training From Scratch and Continual Pre-Training.

Evaluation

General Pre-Training From Scratch

General Performance of the Pre-Trained Base Models.

Mix PT showed improved performance on several benchmarks compared to Vanilla PT, indicating that incorporating fine-tuning data into the pre-training process is beneficial.
Instruct PT achieved even better performance than Mix PT, suggesting that instruction-augmented corpora further enhance model performance.

Comparison between Instruct Pre-Trained Models and Other Open-Source Models

Instruction pre-trained models matched or exceeded the performance of other open-source models (Pythia-1B and BLOOM-3B) using fewer tokens during training, demonstrating data efficiency.

MMLU Performance during Instruction Tuning

The model pre-trained via Instruction Pre Training quickly outperforms the model pre-trained via Vanilla Pre-Training, and we observe a stable increasing trend of our model throughout the instruction tuning process.

Domain-Adaptive Continual Pre-Training

Instruction Pre-Training consistently outperforms Vanilla Pre-Training on almost all domain-specific tasks.
Continual pre-training with Instruction Pre-Training significantly enhances the domain specific performance of Llama3–8B, achieving parity with or even surpassing Llama3–70B.
On the finance NER benchmark, even Llama3- 70B underperforms Llama3–8B, suggesting that this benchmark may not be reliable.

Paper

Instruction Pre-Training: Language Models are Supervised Multitask Learners 2406.14491

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!

Papers Explained 184: Instruction Pretraining

Instruction Synthesizer

LM Pre-Training

General Pre-Training From Scratch

Domain-Adaptive Continual Pre-Training

Experiment Settings

Instruction Synthesizer

General Pre-Training From Scratch

Domain-Adaptive Continual Pre-Training

Evaluation

General Pre-Training From Scratch

Domain-Adaptive Continual Pre-Training

Paper

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ritvik Rastogi

No responses yet