Holistic Agent Leaderboard
The standardized, cost-aware, and third-party leaderboard for evaluating agents.
By the SAgE team at Princeton University

The One-Stop Shop for Agent Benchmarking
Simple, standardized and reproducible agent evaluations
The Problem with Agent Evaluations
Current agent evaluations suffer from:
- • Inconsistent harnesses used across evaluations make it hard to compare results, reproduce evaluations, and are prone to bugs
- • Running new agents is time-consuming due to missing simple evaluation frameworks
- • Lack of cost tracking prevents cost-controlled evaluations
- • Agent evaluations are different from LLM evaluations. Existing products like HELM and the LM eval harness provide standardized LLM evaluations. HAL is inspired by these efforts and closes the gap for agents.
For a detailed analysis of these issues, see AI agents that matter (Kapoor et al., 2024)
HAL's Solution
HAL consists of two independent components:
1. HAL Leaderboards
- • Central platform for assessing agent capabilities
- • Cost-controlled evaluations as default
- • Detailed failure analysis and monitoring
- • Standalone evaluation harness for various benchmarks
- • Easy integration of custom agents and benchmarks independent of agent framework
- • Built-in logging and cost tracking
HAL Leaderboards
SWE-bench Verified
Evaluating agents on resolving real-world GitHub issues
SWE-bench Verified Mini
A compact version of SWE-bench for quicker agent evaluation
USACO
Programming problems from the USA Computing Olympiad
AppWorld Normal
Evaluating interactive coding capabilities in a controlled world of apps and people
AppWorld Challenge
More challenging tasks in the AppWorld environment
CORE-Bench Easy
Computational reproducibility of scientific papers from experimental outputs
CORE-Bench Medium
Computational reproducibility of scientific papers given pre-configured environments
CORE-Bench Hard
Computational reproducibility of scientific papers given code and data
GAIA
General AI assistant benchmark
Cybench
Cybersecurity capabilities and risks of agents
AgentHarm
Agent safety and harmfulness
Who is it for?
HAL serves four key user groups in the AI ecosystem
Downstream Users & Procurers
- • Discover useful but less known benchmarks related to tasks you care about
- • Find out who is building strong agents on these benchmarks
- • Identify the state of the art for both cost and accuracy on these tasks
Benchmark Developers
- • Gain improved visibility for your benchmark
- • Incentivize agent developers to build agents for your benchmark
- • Enable cost-controlled evaluations by default without extra effort
Agent Developers
- • Easily reproduce existing agents and perform unbiased comparisons
- • Compete on a leaderboard in a straightforward way
- • Use HAL harness for framework-agnostic agent evaluation
Safety Researchers
- • Understanding agent capabilities on real-world safety threats and their associated costs
- • For example, Cybench evaluations provide insights into agent performance and affordability for adversaries
Cost-Controlled Agent Evaluations
Understanding the cost-performance trade-off
Why looking at the Pareto frontier matters
- • Agents can be 100x more expensive while only being 1% better
- • Downstream developers can't tell the difference from 1D leaderboards
The HAL Evaluation Harness
A unified framework for reproducible agent evaluation
Standardized Evaluation
- One-stop shop evaluation harness for all benchmarks and agents
- Flexible execution environments for running parallel evaluations locally or in the cloud
Comprehensive Logging
- Automatic logging of agent traces with W&B Weave
- Detailed cost tracking of token usage with minimal edits to agent code
Developer Friendly
- Easy agent integration that does not require a specific agent framework
- Modular architecture that allows for easy extentions with new benchmarks and agents
Agent Monitoring (Experimental)
Visibility into agent behavior and failure modes
Automated Failure Analysis
Our LLM-based automated failure analysis tool identifies recurring failure modes, helping you understand where and why agents struggle.
View failure reports →Monitoring Reports
Get detailed breakdowns of failure categories, their frequencies, and descriptions to guide debugging and agent development.
Agent Traces
Enabling rapid development and debugging while protecting benchmark integrity
Complete Agent Traces
We make available the full traces of agent evaluations, including every single model call as logged by W&B Weave.
Encrypted Distribution
All agent traces are encrypted to prevent benchmark contamination through automated scraping.
Meet the Team
The people behind HAL
Core Team
Contributors
We're grateful to the network of contributors to HAL:
Want to Contribute?
HAL is an open-source project and we welcome contributions from the community.
Cite HAL
@Misc{hal, title = {HAL: A Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation}, author = {Benedikt Stroebl and Sayash Kapoor and Arvind Narayanan}, howpublished = {\url{https://github.com/princeton-pli/hal-harness/}}, year = {2025}}
Funding
HAL is funded by the Princeton AI Lab and the Princeton Language and Intelligence Initiative.