HAL: Holistic Agent Leaderboard
The standardized, cost-aware, and third-party leaderboard for evaluating agents.
By the SAgE team at Princeton University

Performance Highlights
Top performing agents across different benchmarks
AssistantBench
Web AssistanceTop 3 performing agents
CORE-Bench Hard
Scientific ProgrammingTop 3 performing agents
GAIA
Web AssistanceTop 3 performing agents
Online Mind2Web
Web AssistanceTop 3 performing agents
SWE-bench Verified Mini
Software EngineeringTop 3 performing agents
Scicode
Scientific ProgrammingTop 3 performing agents
ScienceAgentBench
Scientific ProgrammingTop 3 performing agents
TAU-bench Airline
Customer ServiceTop 3 performing agents
USACO
ProgrammingTop 3 performing agents
Who is it for?
HAL serves four key user groups in the AI ecosystem
Downstream Users & Procurers
- • Discover useful but less known benchmarks related to tasks you care about
- • Find out who is building strong agents on these benchmarks
- • Identify the state of the art for both cost and accuracy on these tasks
Benchmark Developers
- • Gain improved visibility for your benchmark
- • Incentivize agent developers to build agents for your benchmark
- • Enable cost-controlled evaluations by default without extra effort
Agent Developers
- • Easily reproduce existing agents and perform unbiased comparisons
- • Compete on a leaderboard in a straightforward way
- • Use HAL harness for framework-agnostic agent evaluation
Safety Researchers
- • Understanding agent capabilities on real-world safety threats and their associated costs
Cost-Controlled Agent Evaluations
Understanding the cost-performance trade-off
Why looking at the Pareto frontier matters
- • Agents can be 100x more expensive while only being 1% better
- • Downstream developers can't tell the difference from 1D leaderboards
The HAL Evaluation Harness
A unified framework for reproducible agent evaluation
Standardized Evaluation
- One-stop shop evaluation harness for all benchmarks and agents
- Flexible execution environments for running parallel evaluations locally or in the cloud
Comprehensive Logging
- Automatic logging of agent traces with W&B Weave
- Detailed cost tracking of token usage with minimal edits to agent code
Developer Friendly
- Easy agent integration that does not require a specific agent framework
- Modular architecture that allows for easy extentions with new benchmarks and agents
Agent Traces
Enabling rapid development and debugging while protecting benchmark integrity
Complete Agent Traces
We make available the full traces of agent evaluations, including every single model call as logged by W&B Weave.
Encrypted Distribution
All agent traces are encrypted to prevent benchmark contamination through automated scraping.
Meet the Team
The people behind HAL
Core Team
Contributors
We're grateful to the network of contributors to HAL:
Want to Contribute?
HAL is an open-source project and we welcome contributions from the community.
Cite HAL
@Misc{hal, title = {HAL: A Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation}, author = {Sayash Kapoor and Benedikt Stroebl and Peter Kirgis and Franck Stéphane Ndzomga and Kangheng Liu and Arvind Narayanan}, howpublished = {\url{https://github.com/princeton-pli/hal-harness/}}, year = {2025}}
Funding
HAL is funded by Open Philanthropy, Schmidt Sciences, the Princeton AI Lab and the Princeton Language and Intelligence Initiative. We are grateful to OpenAI for providing API credits to evaluate their models.