Papers Explained 119: DBRX

DBRX is an open, general-purpose Large Language Model (LLM) created by Databricks. It sets a new state-of-the-art surpassing existing models like GPT-3.5 and is competitive with Gemini 1.0 Pro in various benchmarks. It was developed with a fine-grained mixture-of-experts (MoE) architecture, leading to marked improvements in training and inference performance.
Architecture
DBRX uses a transformer-based decoder-only architecture incorporating a fine-grained mixture-of-experts (MoE) design. Specifically, it has 132B total parameters, of which 36B parameters are active on any given input. DBRX was trained on 12T tokens of text and code data.
DBRX architecture is characterized by its use of a larger number of smaller experts, having 16 experts and choosing 4 for any given task, compared to Mixtral and Grok-1, which have 8 experts and choose 2. This approach provides 65x more possible combinations of experts, which is found to improve model quality.
Additionally, DBRX employs rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA), and uses the GPT-4 tokenizer from the tiktoken repository.
The MoE architecture allows DBRX to achieve marked improvements in both training and inference performance. Specifically, inference is up to 2x faster than LLaMA2–70B, and DBRX is about 40% of the size of Grok-1 in terms of both total and active parameter-counts. Training MoEs is also about 2x more FLOP-efficient than training dense models for the same final model quality.
Evaluation
Benchmark Performance
To evaluate DBRX’s performance across standard benchmarks in language understanding, programming, and mathematics like MMLU, HumanEval, GSM8K, and others.

- DBRX sets new state-of-the-art records on several benchmarks, outperforming both open and closed models in various domains.
- Demonstrates exceptional strength in programming and mathematics, surpassing specialized models like CodeLLaMA-70B.
- Scores higher than all other models considered on the MMLU benchmark.

- Across nearly all the considered benchmarks, DBRX Instruct surpasses or — at worst — matches GPT-3.5.
Long-Context Task Performance
To assess DBRX’s performance on long-context tasks and its ability to handle extended sequence lengths on benchmarks such as KV-Pairs and HotpotQAXL. Comparison with the latest versions of GPT-3.5 Turbo and GPT-4 Turbo APIs.

- DBRX performs better than GPT-3.5 Turbo across all context lengths and parts of the sequence, with performance generally similar to Mixtral Instruct.
- Shows competitive performance even when compared to GPT-4 Turbo, especially in the beginning and middle thirds of the context.
Retrieval Augmented Generation (RAG) Performance
To evaluate the effectiveness of DBRX in RAG tasks when provided with content relevant to a prompt from a database, on benchmarks like Natural Questions and HotPotQA, using top 10 passages retrieved from Wikipedia.

- DBRX shows competitive performance with open models and the current version of GPT-3.5 Turbo on RAG tasks.
- Demonstrates the model’s capability to leverage external information effectively in generating responses.
Paper
Hungry for more insights?
Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!
Do Subscribe for weekly updates!!