Papers Explained 176: Smol LM

Ritvik Rastogi
5 min readAug 6, 2024

--

SmolLM is a series of state-of-the-art small language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are built on a meticulously curated high-quality training corpus, which we are releasing as SmolLM-Corpus. Smollm Corpus includes:

  • FineWeb-Edu (deduplicated): educational web samples from FineWeb (220B tokens)
  • Cosmopedia v2: A collection of synthetic textbooks and stories generated by Mixtral (28B tokens)
  • Python-Edu: educational Python samples from The Stack (4B tokens)

The models and dataset are available at HuggingFace.

Recommended Reading [Cosmopedia] [FineWeb]

Data Curation

FineWeb-Edu

Comparison of FineWeb-Edu to other open web datasets.

FineWeb-Edu is a dataset released along with FineWeb. It consists of 1.3T tokens of educational web pages filtered from the FineWeb dataset. An educational quality classifier is developed using annotations generated by Llama3–70B-Instruct, which is then used to retain only the most educational web pages from FineWeb.

In Smollm-Corpus 220B deduplicated tokens from FineWeb are used.

Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, a synthetic dataset for pre-training, which consists of over 30 million textbooks, blog posts, and stories generated by Mixtral-8x7B-Instruct-v0.1.

To improve the prompts, two strategies were tried:

  1. Using more capable models with the same prompts, but no significant improvements were found.
  2. Optimizing the prompts themselves

The final dataset consisted of 39 million synthetic documents consisting of 28B tokens of textbooks, stories, articles, and code, with a diverse range of audiences and over 34,000 topics.

Python-Edu

Comparison of Python-Edu to unfiltered Python code.

The idea of FineWeb-Edu was applied to Code. The Stack dataset’s 500,000 Python samples were annotated using Llama3, and they were used to train an educational classifier. This classifier was trained on the StarCoder models training corpus’ Python subset. A refined dataset of 4 billion tokens was obtained by retaining only the samples with a score of 4 or higher from the available 40 billion Python tokens.

Training

Training mixture of SmolLM models.

SmolLM models are trained on the data mixture below:

  • 135M and 360M models, each trained on 600B tokens from Smollm-Corpus
  • 1.7B model, trained on 1T tokens from Smollm-Corpus
Architecture details of SmolLM models.

For the architecture of the 135M and 360M parameter models,a design similar to MobileLLM is adopted, incorporating Grouped-Query Attention (GQA) and prioritizing depth over width.

The 1.7B parameter model uses a more traditional architecture.

For all three models, embedding tying is used and have a context length of 2048 tokens. A tokenizer trained on the Smollm Corpus with a vocab size of 49152 is used.

The models were instruction tuned using publicly available permissive instruction datasets.

  • All three models were trained for one epoch on the permissive subset of the WebInstructSub dataset, combined with StarCoder2-Self-OSS-Instruct.
  • Direct Preference Optimization (DPO) was performed for one epoch.
  • HelpSteer was used for the 135M and 1.7B models
  • argilla/dpo-mix-7k was used for the 360M model.

Evaluation

Comparison of SmolLM models to other SLMs. We evaluate all models on the same setup, except for MobieLLM, which isn’t publicly available.
Evaluation of SmolLM models on HumanEval.
  • SmolLM-135M outperforms the current best model with less than 200M parameters, MobileLM-125M, despite being trained on only 600B tokens compared to MobileLM’s 1T tokens.
  • SmolLM 360M outperforms all models with less than 500M parameters, despite having fewer parameters and being trained on less than a trillion tokens (600B) as opposed to MobileLM-350M and Qwen2–500M.
  • SmolLM-1.7B outperforms all other models with less than 2B parameters, including Phi1.5 from Microsoft, MobileLM-1.5B, and Qwen2–1.5B.
  • SmolLM-1.7B shows strong Python coding performance with 24 pass@1. Note that the evaluation score for Qwen2–1.5B is different from the 31.1 pass@1 reported by the Qwen team.
Evaluation of SmolLM models on different reasoning and common knowledge benchmarks.
  • SmolLM models outperform other models in their size categories across a diverse set of benchmarks, testing common sense reasoning and world knowledge.
Evaluation of SmolLM-Instruct models on IFEval.
  • Qwen2–1.5B-Instruct model scores the highest with 29.94, SmolLM-Instruct models provide a good balance between model size and performance, using only publicly available permissive datasets.

SmolLM v0.2

v0.2 models are better at staying on topic and responding appropriately to standard prompts, such as greetings and questions about their role as AI assistants. SmolLM-360M-Instruct (v0.2) has a 63.3% win rate over SmolLM-360M-Instruct (v0.1) on AlpacaEval.

V0.1 models were fine-tuned on the permissive subset of the WebInstructSub dataset, combined with StarCoder2-Self-OSS-Instruct. Then, DPO was performed for one epoch on HelpSteer for the 135M and 1.7B models, and argilla/dpo-mix-7k for the 360M model.

V0.2 models are trained on everyday-conversations-llama3.1–2k, Magpie-Pro-300K-Filtered, StarCoder2-Self-OSS-Instruct, and a small subset of OpenHermes-2.5.

Paper

SmolLM — blazingly fast and remarkably powerful

Recommended Reading [Small LLMs]

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!

--

--