ICLR 2025 Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models Oral

Oral

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola · Subham Sahoo · Aaron Gokaslan · Zhihan Yang · Zhixuan Qi · Jiaqi Han · Justin Chiu · Volodymyr Kuleshov

Garnet 212-213

[ Abstract ] [ Visit Oral Session 2D ]

Thu 24 Apr 1:18 a.m. — 1:30 a.m. PDT

[ OpenReview]

Abstract:

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and faster inference with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://mariannearriola.github.io/bd3-lms

Chat is not available.