Papers Explained 150: MarianMT

3 min readJun 13, 2024

Marian is a robust and self-contained Neural Machine Translation system. It is entirely implemented in C++ and features a built-in automatic differentiation engine using dynamic computation graphs. Its encoder-decoder framework is specifically designed to combine efficient training and rapid translation, rendering it an ideal research-friendly toolkit.

Design Outline

The deep-learning back-end is based on reverse-mode auto-differentiation with dynamic computation graphs, similar to DyNet. The back-end is optimized for machine translation and similar use cases, with efficient implementations of fused RNN cells, attention mechanisms, and atomic layer-normalization.

The encoder and decoder are implemented as classes with a simplified interface:

class Encoder {
  EncoderState build(Batch);
};
class Decoder {
  DecoderState startState(EncoderState[]);
  DecoderState step(DecoderState, Batch);
};

The encoder-decoder model is implemented as a Bahdanau-style model, where the encoder is built inside `Encoder::build` and the resulting encoder context is stored in the `EncoderState` object. The decoder receives a list of `EncoderState` objects and creates the initial `DecoderState`. The `Decoder::step` function consumes the target part of a batch to produce the output logits of the model.

The framework allows for combining different encoders and decoders, such as RNN-based encoders with Transformer decoders, and reduces implementation effort. It is possible to implement a single inference step to train, score, and translate with a new model.

Additionally, Marian includes many efficient meta-algorithms, such as:

Multi-device (GPU or CPU) training, scoring, and batched beam search
Ensembling of heterogeneous models (e.g. Deep RNN models and Transformer or language models)
Multi-node training

Architecture and Training

The Marian toolkit is used to implement a sequence-to-sequence model with a Transformer-style architecture.

The model architecture consists of:

A sequence-to-sequence model with single-layer RNNs in both the encoder and decoder
Bi-directional RNN in the encoder
Stacked GRU-blocks in the encoder and decoder
Attention mechanism between the first and second block in the decoder
Embeddings size of 512, RNN state size of 1024
Layer normalization and variational dropout inside GRU-blocks and attention

The training recipe consists of:

Preprocessing of training data, including tokenization, true-casing, and vocabulary reduction
Training of a shallow model for backtranslation on parallel WMT17 data
Translation of 10M German monolingual news sentences to English
Concatenation of artificial training corpus with original data (times two) to produce new training data
Training of four left-to-right (L2R) deep models (either RNN-based or Transformer-based)
Training of four additional deep models with right-to-left (R2L) orientation
Ensemble-decoding with four L2R models resulting in an n-best list of 12 hypotheses per input sentence
Rescoring of n-best list with four R2L models, with all model scores weighted equally
Evaluation on newstest-2016 (validation set) and newstest-2017 with sacreBLEU

Paper

Marian: Fast Neural Machine Translation in C++ 1804.00344

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Data Science

Artificial Intelligence

Machine Learning

Deep Learning

NLP

Written by Ritvik Rastogi

1.5K Followers

0 Following

Data Scientist, 2x Kaggle Expert

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

More from Ritvik Rastogi

Papers Explained 322: Phi 4 Mini, Phi 4 Multimodal

Ritvik Rastogi

Papers Explained 322: Phi 4 Mini, Phi 4 Multimodal

A 3.8B parameter language model excelling in math and coding, utilizing high-quality web and synthetic data, and featuring a 200K token…

Mar 4

DAIR.AI

Ritvik Rastogi

Papers Explained 28: Masked AutoEncoder

The appetite for data has been successfully addressed in natural language processing (NLP) by self-supervised pretraining. The solutions…

Feb 9, 2023

Ritvik Rastogi

Papers Explained 329: Gemma 3

Gemma 3 is a multimodal addition to the Gemma family, ranging in scale from 1 to 27 billion parameters. This version introduces vision…

Mar 13

Ritvik Rastogi

Papers Explained 333: SmolDocling

SmolDocling is a 256M parameter vision-language model Based on Hugging Face’s SmolVLM designed for end-to-end document conversion. It…

Mar 19

See all from Ritvik Rastogi

Recommended from Medium

Kamalmeet Singh

A Brief History of Generative AI

Over the past year, terms like Artificial Intelligence (AI) and Generative AI (GenAI) have dominated the conversation, particularly fueled…

Oct 14, 2024

Your First Hands-On Lesson On Using AlphaFold

Level Up Coding

Dr. Ashish Bamania

Your First Hands-On Lesson On Using AlphaFold

A practical lesson on folding proteins using AlphaFold (and more!)

4d ago

Papers Explained Review 11: Auto Encoders

Ritvik Rastogi

Papers Explained Review 11: Auto Encoders

Dec 31, 2024

Hacked by Design: Why AI Models Cheat Their Own Teachers & How to Stop It

U.V.

Hacked by Design: Why AI Models Cheat Their Own Teachers & How to Stop It

Teacher hacking is the new challenge in AI training — causing misleading optimizations and flawed decision-making.

Feb 11

Manuel Brenner

Science, Chaos, and the Limits of AGI

Scientific progress has always been shaped by the interplay of experiment and theory: it required both Tycho Brahe, the eccentric Danish…

3d ago

SmolDockling — Hugging Face’s Tiny OCR & Document Understanding Model

Data And Beyond

TONI RAMCHANDANI

SmolDockling — Hugging Face’s Tiny OCR & Document Understanding Model

In a world obsessed with scaling LLMs to 70 billion parameters, Hugging Face did something wild — they went small. Really small. Enter…

4d ago

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech

Papers Explained 150: MarianMT

Design Outline

Architecture and Training

Paper

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ritvik Rastogi

No responses yet

More from Ritvik Rastogi

Papers Explained 322: Phi 4 Mini, Phi 4 Multimodal

A 3.8B parameter language model excelling in math and coding, utilizing high-quality web and synthetic data, and featuring a 200K token…

Papers Explained 28: Masked AutoEncoder

The appetite for data has been successfully addressed in natural language processing (NLP) by self-supervised pretraining. The solutions…

Papers Explained 329: Gemma 3

Gemma 3 is a multimodal addition to the Gemma family, ranging in scale from 1 to 27 billion parameters. This version introduces vision…

Papers Explained 333: SmolDocling

SmolDocling is a 256M parameter vision-language model Based on Hugging Face’s SmolVLM designed for end-to-end document conversion. It…

Recommended from Medium

A Brief History of Generative AI

Over the past year, terms like Artificial Intelligence (AI) and Generative AI (GenAI) have dominated the conversation, particularly fueled…

Your First Hands-On Lesson On Using AlphaFold

A practical lesson on folding proteins using AlphaFold (and more!)

Papers Explained Review 11: Auto Encoders

Table of Contents

Hacked by Design: Why AI Models Cheat Their Own Teachers & How to Stop It

Teacher hacking is the new challenge in AI training — causing misleading optimizations and flawed decision-making.

Science, Chaos, and the Limits of AGI

Scientific progress has always been shaped by the interplay of experiment and theory: it required both Tycho Brahe, the eccentric Danish…

SmolDockling — Hugging Face’s Tiny OCR & Document Understanding Model

In a world obsessed with scaling LLMs to 70 billion parameters, Hugging Face did something wild — they went small. Really small. Enter…