Oral
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Oral #1: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Vimal Thilak
Scaling the capacity of language models has consistently proven to be a re-liable approach for improving performance and unlocking new capabilities.Capacity can be primarily defined by two dimensions: the number of modelparameters and the compute per example. While scaling typically involvesincreasing both, the precise interplay between these factors and their com-bined contribution to overall capacity remains not fully understood. Weexplore this relationship in the context of sparse Mixture-of-Expert models(MoEs), which allow scaling the number of parameters without proportion-ally increasing the FLOPs per example. We investigate how varying thesparsity level, i.e., the fraction of inactive parameters, impacts model’s per-formance during pretraining and downstream evaluation. We find that un-der different constraints (e.g., parameter size and total training compute),there is an optimal level of sparsity that improves both training efficiencyand model performance. These results provide a better understanding ofthe impact of sparsity in scaling laws for MoEs and complement existingworks in this area, offering insights for designing more efficient architec-tures.