Papers Explained 134: Open ELM

4 min readMay 8, 2024

OpenELM is an open language model by Apple with not only open source model weights and inference code but the complete framework for training and evaluation of the language model.

It uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy.

The source code along with pre-trained model weights and training recipes are available at GitHub.

OpenELM Architecture

OpenELM adopts the decoder-only transformer-based architecture.

Learnable bias parameters are not used in any fully-connected (linear) layers
Pre-normalization is applied using RMSNorm
Rotary positional embedding (ROPE) is used for encoding positional information
Grouped query attention (GQA) is used instead of multi-head attention (MHA)
The feed forward network (FFN) is replaced with SwiGLU FFN
Flash attention is used for computing the scaled dot-product attention
The tokenizer from LLama is used.

Layer-wise scaling

A standard transformer model has N transformer layers and the dimensionality of the input to each layer is d_model. The MHA has n_h heads and dimension of each head is d_h = d_model /n_h .Also, the hidden dimension for FFN is d_FFN = m · d_model, where m is a scalar FFN multiplier.

Parameters α and β are introduced to scale the number of attention heads n_h and FFN multiplier m per layer, thus the width of the FFN layers respectively. For the i-th layer, n_h and m are computed as:

Pre-training data

Publicly available datasets totalling 1.8 T tokens used for pretraining.

Text filtering and tokenization are performed on the fly facilitating seamless experimentation with various tokenizers. Sequences having less than 200 characters or 256 tokens are removed.

Training details

OpenELM variants are pre-trained for 350k iterations, with the following hyperparameters.

The cleaned variant of UltraFeedback dataset that consists of 60k prompts is used for instruction tuning using the Alignment Handbook library. For optimization, either the statistical rejection sampling method or the direct preference optimization method is used.

Instruction tuning details for different variants of OpenELM.

Evaluation

Tasks and metrics used for evaluating OpenELM.

Pre-training results

. OpenELM’s performance across training iterations on standard zero-shot tasks.

OpenELM shows an overall increase in accuracy with longer training durations.

OpenELM 1.1 B parameters 1.28% (Zero Shot Tasks), 2.36% (OpenLLM Leaderboard), and 1.72% (LLM360) higher accuracy compared to OLMo 1.2 B, while using 2× less pretraining data..