Papers Explained 106: Gemma

4 min readFeb 28, 2024

Gemma are a family of lightweight (2B and 7B), state-of-the art open language models built from the research and technology used to create Gemini models. Unlike Gemini, these models are not multimodal, nor are they trained for state-of-the-art performance on multilingual tasks.

Parameter counts for both sizes of Gemma models.

The project is available at https://github.com/google-deepmind/gemma.

Recommended Reading [Papers Explained 105: Gemini 1.5 Pro]

Model Architecture

The Gemma model architecture is based on the transformer decoder with the following improvements:

The 7B model uses multi-head attention while the 2B models use multi-query attention (with 𝑛𝑢𝑚_𝑘𝑣_ℎ𝑒𝑎𝑑𝑠 = 1)
RoPE Embeddings in place of absolute positional embeddings.
Embeddings are shared across the inputs and outputs to reduce model size.
ReLU nonlinearity is replaced by the GeGLU activation function.
Both the input and the output of each transformer sub-layer are normalized using RMSNorm.
The models are trained on a context length of 8192 tokens.

Training

A subset of the SentencePiece tokenizer of Gemini is used. It splits digits, does not remove extra whitespace, and relies on byte-level encodings for unknown tokens. The vocabulary size is 256k tokens.

Pretraining

Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code. The data is filtered using both heuristics and model-based classifiers to remove harmful or low-quality content. All evaluation sets from the pre-training data mixture are also filtered out.

Instruction Tuning

Gemma models are firmtuned with supervised fine-tuning on a mix of English only synthetic and human-generated prompt response pairs and reinforcement learning from human feedback (RLHF) with the reward model trained on labeled English-only preference data and the policy based on a set of high-quality prompts.

Instruction tuned models are trained with a specific format indicating roles in a conversation, such as the User role, and delineating turns in a conversation.

Relevant formatting control tokens used for both Instruction Tuning of Gemma models.

Supervised Fine-Tuning

Given a set of held out prompts, responses are generated from a test model, and a baseline model, and a larger, high capability model is asked to express a preference between two responses, removing examples that show certain personal information, unsafe or toxic model outputs, mistaken self-identification data, or duplicated examples.

Reinforcement Learning from Human Feedback

Pairs of preferences are collected from human raters and a reward function is trained under the Bradley-Terry model. The policy was trained to optimize this reward function using a variant of REINFORCE with a Kullback–Leibler regularization term towards the initially tuned model.

Evaluation

Human Preference Evaluations

Win rate of Gemma models versus Mistral 7B v0.2 Instruct with 95% confidence intervals.

To compare Gemma 7B IT and Gemma 2B IT models against Mistral v0.2 7B Instruct model in human preference evaluations
Human evaluation studies were conducted on a held-out collection of around 1000 prompts for creative writing tasks, coding, and following instructions, and set of around 400 prompts was used to test basic safety protocols.
Gemma 7B IT outperforms Mistral v0.2 7B Instruct in creative writing tasks, coding, and following instructions, as well as in testing basic safety protocols. Gemma 2B IT also performs well but slightly lower than Gemma 7B IT.

Automated Benchmarks

Academic benchmark results, compared to similarly sized, openly-available models trained on general English text data.

Gemma models’ performance on various domains including physical reasoning, social reasoning, question answering, coding, mathematics, commonsense reasoning, language modeling, reading comprehension, etc. is compared to OSS LLMs.
Gemma 7B outperforms all open-source alternatives at the same or smaller scale on the MMLU benchmark, and even several larger models, including LLaMA2 13B. However, it still falls short of the human expert performance benchmarked at 89.8%.
Gemma models demonstrate strong performance on mathematics and coding benchmarks, outperforming other models by at least 10 points on GSM8K and the MATH benchmark, and by at least 6 points on HumanEval.
Gemma 7B surpasses the performance of code-fine-tuned CodeLLaMA-7B models on the MBPP benchmark, achieving a score of 44.4% compared to CodeLLaMA’s 41.4%.

Paper

Gemma: Open Models Based on Gemini Research and Technology

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!