Papers Explained 130: Phi-3

3 min readApr 29, 2024

phi-3-mini is a 3.8B language model trained on 3.3T tokens data which is a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data.It rivals the performance of models such as Mixtral 8x7B and GPT-3.5. Furthermore, 7B and 14B models are trained for 4.8T tokens, called phi-3-small and phi-3-medium, performing significantly better than phi-3-mini.

The models are available on HuggingFace.

Architecture

The phi-3-mini model is built upon a similar block structure as Llama-2 and uses the same tokenizer with vocabulary size of 32064. The model uses 3072 hidden dimension, 32 heads and 32 layers. It is trained using bfloat16 with default context length 4K.

phi-3-mini-128K is also introduced that extends the context length to 128K, via LongRope .

The phi-3-small model (7B) leverages the tiktoken tokenizer for better multilingual tokenization with a vocabulary size of 100352 and has default context length 8K. It follows the standard decoder architecture of a 7B model class, having 32 layers and a hidden size of 4096. To minimize KV cache footprint, the model also leverages a grouped-query attention, with 4 queries sharing 1 key. Moreover it uses alternative layers of dense attention and a novel block sparse attention to further optimize on KV cache savings while maintaining long context retrieval performance. An additional 10% multilingual data was also used for this model.

The phi-3-medium model with (14B) is trained using the same tokenizer and architecture of phi-3-mini, and trained on the same data for slightly more epochs (4.8T tokens total as for phi-3-small). The model has 40 heads and 40 layers, with embedding dimension 5120.

Data

The sequence of works initiated in “Textbooks Are All You Need” is followed to utilize high quality training data to improve the performance of small language models and deviate from the standard scaling-laws.

In particular,the web data is filtond to contain the correct level of “knowledge” and keep more web pages that could potentially improve the “reasoning ability” for the model.

Training Methodology

Pre-training is performed in two disjoint and sequential phases:

Phase-1 comprises mostly web sources aimed at teaching the model general knowledge and language understanding.

Phase-2 merges even more heavily filtered webdata (a subset used in Phase-1) with some synthetic data that teach the model logical reasoning and various niche skills.

Post-training of phi-3-mini went through two stages, including supervised finetuning (SFT) and direct preference optimization (DPO).

SFT leverages highly curated high-quality data across diverse domains, e.g., math, coding, reasoning, conversation, model identity, and safety. The SFT data mix starts with using English-only examples.

DPO data covers chat format data, reasoning, and responsible AI efforts.

The models are chat fine-tuned with following the chat template: