Papers Explained 93: TinyLlama

3 min readJan 22, 2024

TinyLlama is a compact 1.1B language model built upon the architecture and tokenizer of Llama 2, pre-trained on around 1 trillion tokens for approximately 3 epochs, leveraging various advances (e.g., FlashAttention), to achieve better computational efficiency.

The model checkpoints and code are publicly available on GitHub

Recommended Reading [Papers Explained 55: LLaMA] [Papers Explained 60: Llama 2]

Approach

Pre-training data

A mixture of natural language data and code data is used to pre-train TinyLlama, sourcing natural language data from SlimPajama and code data from Starcoderdata.

SlimPajama is a large open-source corpus derived by cleaning and deduplicating the original RedPajama,. which is an open-source research effort aimed at reproducing Llama’s pre-training data.

Starcoderdata was collected to train StarCoder, comprising approximately 250 billion tokens across 86 programming languages.

Combining these two corpora yields approximately 950 billion tokens for pre-training in total. TinyLlama is trained on these tokens for approximately three epochs. During training, the natural language data is sampled to achieve a ratio of around 7:3 between natural language data and code data.

Architecture

Details of model architecture.

A similar model architecture to Llama 2 is adopted with the following details:

RoPE (Rotary Positional Embedding) to inject positional information into the model.
In pre-normalization, to attain a more stable training, the input is normalized before each transformer sub-layer using RMSNorm, which can improve training efficiency.
Following Llama 2 SwiGLU is used as the activation function.
To reduce memory bandwidth overhead and speed up inference, grouped-query attention is used. There are 32 heads for query attention and 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance.
Another critical improvement is the integration of Flash Attention 2, an optimized attention mechanism. The repository also provides fused layernorm, fused cross entropy loss, and fused rotary positional embedding, which together play a pivotal role in boosting computational throughput