Poster
BOND: Aligning LLMs with Best-of-N Distillation
Pier Giuseppe Sessa · Robert Dadashi · Léonard Hussenot-Desenonges · Johan Ferret · Nino Vieillard · Alexandre Rame · Bobak Shahriari · Sarah Perrin · Abram Friesen · Geoffrey Cideron · Sertan Girgin · Piotr Stanczyk · Andrea Michi · Danila Sinopalnikov · Sabela Ramos Garea · Amélie Héliou · Aliaksei Severyn · Matthew Hoffman · Nikola Momchev · Olivier Bachem
Hall 3 + Hall 2B #198
Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models.Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates.In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.