Papers Explained 303: Reward rAnked FineTuning (RAFT)

Ritvik Rastogi
5 min readFeb 5, 2025

Generative foundation models can inherit implicit biases from their extensive unsupervised training data, leading to suboptimal samples, skewed outcomes, and unfairness. Reinforcement Learning from Human Feedback (RLHF) has been primarily used to address this alignment challenge. However, RL algorithms can be inefficient and unstable, hindering successful alignment.

To this end, Reward rAnked FineTuning (RAFT) is designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, the approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples.

Reward rAnked FineTuning

Consider an initial generative model G0 = g(w0, x) with model parameter w0. This model can take input x and generate an output y according to a distribution p_G0¹/λ. λ is a temperature parameter to control the diversity. A reward function r(x, y) is also assumed, which returns a reward for any input-output pair (x, y). This reward function is used to guide the model g(w,x). If pg(y|w,x) is denoted as the conditional distribution given x associated with w and consider a distribution D of the training input x, the objective is:

If the generative model is powerful enough to achieve the maximum at each prompt x, then the solution of [EQ 1] is

In practice, it is generally infeasible to search the entire output space to find the optimal policy. However, our policy can be enhanced by fine-tuning our models using a high-reward dataset. One natural choice is to do so with a pre-determined high-quality dataset. Unfortunately, previous studies have shown that SFT with a pre-determined dataset is usually of inferior performance. model’s performance in offline learning heavily depends on the coverage of the offline dataset.

The idea is to utilize the trained generative model, to generate additional samples and reinforcing the dataset. For each prompt, we may sample K responses from the model and take the response with the highest reward. Then, we can fine-tune our model with these best-of-K samples to improve the model. This process can be iterated for multiple times as the improved generative model in turn provides a better approximation of Eq. (2), leading to further enhancements for the model.

The learning process of RAFT can be divided into three steps. For each stage t + 1:

  1. Data collection: A batch of prompts Dt = {xt1, · · · , xtb} is sampled and y1, . . . , yK is generated for each xti.
  2. Data ranking: The reward model is used to compute {r(x, y1), · · · , r(x, yK )} for each x. Then, y := argmaxyj∈{y1,···,yK} r(x,yj) is taken for all prompts.
  3. Model fine-tuning: The current model is fine-tuned.

These three steps are iteratively alternated until the reward converges.

LLM Experiments

LLaMA-7B is used as the base LLM. Open-LLaMA-3B is used for the reward model. GPT-Neo-2.7B is used in a distillation experiment. HH-RLHF (Helpful and Harmless) dataset, containing 112K training samples and 12.5K test samples, each with a prompt and “chosen” and “rejected” responses. The training follows three stages: Supervised Fine-Tuning (SFT), reward modeling, and RLHF. RAFT iteratively samples K responses from the current model, ranks them based on the reward model, and fine-tunes the model on the highest-ranked response. PPO is used as a baseline comparison.

Complete table of results on HH-RLHF dataset.
  • RAFT achieves higher mean reward (2.294) compared to PPO and the SFT baseline, while maintaining reasonable perplexity (4.031).
  • RAFT shows better preservation of perplexity and diversity compared to PPO, suggesting a reduction in alignment tax.
GPT-4 and Human evaluation results on the HH-RLHF dataset.
  • GPT-4 and Human Evaluation: Both GPT-4 and human evaluations support the superiority of RAFT over PPO.
Test results on the hand-out set under different K.
  • Impact of K: Larger K in RAFT leads to higher reward but increased computational cost.
Test results on the hand-out set under different temperatures λ.
  • Impact of Temperature: Higher sampling temperature (λ) increases diversity but slightly reduces reward.
Test results on the hand-out set under different choices of the KL coefficient β.
  • Impact of KL Penalty: KL penalty helps control the divergence from the initial model but can also affect reward learning.
Test results on the hand-out set under different learning objectives.
  • RAFT allows for efficient distillation, where a smaller model (GPT-Neo-2.7B) can be aligned using data generated by a larger model (LLaMA-7B), achieving improved performance.

Diffusion Model Experiments

Stable-diffusion v1.5 (SD-1.5) is finetuned using the RAFT. LoRA is used for efficient fine-tuning. CLIP is used as a reward function, leveraging both aesthetic scores and text-image matching. Experiments compare RAFT against DDPO (Decentralized Distributed Proximal Policy Optimization).

Resolution adaptation.
Resolution Adaptation. (RAFT-aligned models can generate proper 256 × 256 samples).
  • Resolution Adaptation (256x256): RAFT successfully restores SD-1.5’s ability to generate images at 256x256 resolution. RAFT significantly improves image quality at this resolution, both for in-domain (CIFAR-10 labels) and out-of-domain (CIFAR-100 labels) prompts. While DDPO achieves similar performance, RAFT is approximately 50x faster.
Text-Image Alignment with RAFT. (512×512 resolution).
  • RAFT improves the alignment between generated images and text prompts at 512x512 resolution, addressing the issue of SD-1.5 prioritizing style information over object representation in prompts.

Paper

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment 2304.06767

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Ritvik Rastogi
Ritvik Rastogi

Written by Ritvik Rastogi

Data Scientist, 2x Kaggle Expert

No responses yet

Write a response