ICLR Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

Aviv Bick · Tobias Katsch · Nimit Sohoni · Arjun Desai · Albert Gu

Keywords: [ LLM;Mamba;distillation;efficiency; ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

We present the Llamba model series, a family of highly efficient recurrent language models distilled from the Llama-3.x family into the Mamba architecture. The series includes Llamba-1B, Llamba-4B, and Llamba-8B, delivering fast inference throughput while maintaining competitive benchmark performance. Beyond its computational advantages, Llamba showcases the effectiveness of the MOHAWK distillation framework, achieving high-quality performance while being distilled with less than 0.1\% of the data typically used for models of similar size. We also provide an optimized implementation of the Llamba models for deployment on resource-constrained devices, such as smartphones and edge platforms, providing a practical and memory-efficient alternative to traditional Transformer architectures. Overall, these models set new standards for speed, memory efficiency, and accessibility of language models.

Chat is not available.

Poster in Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

Aviv Bick · Tobias Katsch · Nimit Sohoni · Arjun Desai · Albert Gu

Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS