Holistic Agent Leaderboard

The standardized, cost-aware, and third-party leaderboard for evaluating agents.

By the SAgE team at Princeton University

95
Agents
11
Benchmarks
HAL Hero

The One-Stop Shop for Agent Benchmarking

Simple, standardized and reproducible agent evaluations

The Problem with Agent Evaluations

Current agent evaluations suffer from:

  • • Inconsistent harnesses used across evaluations make it hard to compare results, reproduce evaluations, and are prone to bugs
  • • Running new agents is time-consuming due to missing simple evaluation frameworks
  • • Lack of cost tracking prevents cost-controlled evaluations
  • • Agent evaluations are different from LLM evaluations. Existing products like HELM and the LM eval harness provide standardized LLM evaluations. HAL is inspired by these efforts and closes the gap for agents.

For a detailed analysis of these issues, see AI agents that matter (Kapoor et al., 2024)

HAL's Solution

HAL consists of two independent components:

1. HAL Leaderboards

  • • Central platform for assessing agent capabilities
  • • Cost-controlled evaluations as default
  • • Detailed failure analysis and monitoring
View Leaderboards

2. HAL Harness

  • • Standalone evaluation harness for various benchmarks
  • • Easy integration of custom agents and benchmarks independent of agent framework
  • • Built-in logging and cost tracking
View on GitHub

Who is it for?

HAL serves four key user groups in the AI ecosystem

Downstream Users & Procurers

  • • Discover useful but less known benchmarks related to tasks you care about
  • • Find out who is building strong agents on these benchmarks
  • • Identify the state of the art for both cost and accuracy on these tasks

Benchmark Developers

  • • Gain improved visibility for your benchmark
  • • Incentivize agent developers to build agents for your benchmark
  • • Enable cost-controlled evaluations by default without extra effort

Agent Developers

  • • Easily reproduce existing agents and perform unbiased comparisons
  • • Compete on a leaderboard in a straightforward way
  • • Use HAL harness for framework-agnostic agent evaluation

Safety Researchers

  • • Understanding agent capabilities on real-world safety threats and their associated costs
  • • For example, Cybench evaluations provide insights into agent performance and affordability for adversaries

Cost-Controlled Agent Evaluations

Understanding the cost-performance trade-off

Why looking at the Pareto frontier matters

  • Agents can be 100x more expensive while only being 1% better
  • Downstream developers can't tell the difference from 1D leaderboards
Cost ($) Performance (%) Agent A Agent B Agent C Cost-performance frontier

The HAL Evaluation Harness

A unified framework for reproducible agent evaluation

Standardized Evaluation

  • One-stop shop evaluation harness for all benchmarks and agents
  • Flexible execution environments for running parallel evaluations locally or in the cloud

Comprehensive Logging

  • Automatic logging of agent traces with W&B Weave
  • Detailed cost tracking of token usage with minimal edits to agent code

Developer Friendly

  • Easy agent integration that does not require a specific agent framework
  • Modular architecture that allows for easy extentions with new benchmarks and agents

Agent Monitoring (Experimental)

Visibility into agent behavior and failure modes

Automated Failure Analysis

Our LLM-based automated failure analysis tool identifies recurring failure modes, helping you understand where and why agents struggle.

View failure reports →

Monitoring Reports

Get detailed breakdowns of failure categories, their frequencies, and descriptions to guide debugging and agent development.

Agent Traces

Enabling rapid development and debugging while protecting benchmark integrity

Complete Agent Traces

We make available the full traces of agent evaluations, including every single model call as logged by W&B Weave.

Encrypted Distribution

All agent traces are encrypted to prevent benchmark contamination through automated scraping.

Meet the Team

The people behind HAL

Core Team

Benedikt Stroebl

PhD Student, Princeton University

Website →

Sayash Kapoor

PhD Student, Princeton University

Website →

Arvind Narayanan

Professor of Computer Science, Princeton University

Website →

Contributors

We're grateful to the network of contributors to HAL:

Amit Arora
Amazon
Aymeric Roucher
Hugging Face
Hailey Schoelkopf
Anthropic
Harsh Trivedi
Stony Brook
Iason Gabriel
Google DeepMind
Jelena Luketina
UK AISI
JJ Allaire
UK AISI
Laura Weidinger
Google DeepMind
Madhur Prashant
Amazon
Marius Hobbhahn
Apollo Research
Maximillian Kaufmann
UK AISI
Morgan McGuire
Weights & Biases
Omar Khattab
MIT
Parth Asawa
UC Berkeley
Rishi Bommasani
Stanford
Shreya Shankar
UC Berkeley
Shayne Longpre
MIT
William Isaac
Google DeepMind
Yifan Mai
Stanford
Zachary Siegel
Princeton

Want to Contribute?

HAL is an open-source project and we welcome contributions from the community.

Cite HAL

@Misc{hal,
title =        {HAL: A Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation},
author =       {Benedikt Stroebl and Sayash Kapoor and Arvind Narayanan},
howpublished = {\url{https://github.com/princeton-pli/hal-harness/}},
year =         {2025}}

Funding

HAL is funded by the Princeton AI Lab and the Princeton Language and Intelligence Initiative.