Poster
in
Workshop: Modular, Collaborative and Decentralized Deep Learning
Multi-Agent Verification: Scaling Test-Time Compute with Goal Verifiers
Shalev Lifshitz · Sheila McIlraith · Yilun Du
Scaling test-time computation has recently emerged as a promising direction for improving large language model (LLM) performance. A common approach relies on external verifiers --- models or programs that assess solution quality --- to select between a set of generated solutions. While most approaches use trained reward models as external verifiers, we introduce Goal Verifiers (GVs), out-of-the-box LLMs prompted to verify different aspects of solution correctness by providing binary approvals. Unlike reward models, GVs require no additional training and naturally support combining multiple verification signals through simple voting mechanisms. Building on top of GVs, we propose Multi-Agent Verification (MAV), a framework that leverages multiple verifier agents to provide a stronger verification signal. MAV introduces a novel scaling dimension for test-time compute: scaling the number and type of verifier agents. We illustrate how multi-agent verification improves the performance of various closed-source and open-source language models, where performance improves with both the number and diversity of GVs. In addition, we illustrate how MAV enables self-improvement by using the same model for both the generator and multiple verifiers.