HAL: Holistic Agent Leaderboard

The standardized, cost-aware, and third-party leaderboard for evaluating agents. Read the paper here.

By the SAgE team at Princeton University

23 426
Rollouts
9
Benchmarks
HAL Hero

Performance Highlights

Top performing agents across different benchmarks

AssistantBench

Web Assistance

Top 3 performing agents

Browser-Use
o3 Medium (April 2025)
38.8%
$15.15
Browser-Use
GPT-5 Medium (August 2025)
35.2%
$41.69
Browser-Use
o4-mini Low (April 2025)
28.1%
$9.22
View Full Leaderboard

CORE-Bench Hard

Scientific Programming

Top 3 performing agents

CORE-Agent
Claude Opus 4.1 (August 2025)
51.1%
$412.42
CORE-Agent
Claude Sonnet 4.5 High (September 2025)
44.4%
$92.34
CORE-Agent
Claude Opus 4.1 High (August 2025)
42.2%
$509.95
View Full Leaderboard

GAIA

Web Assistance

Top 3 performing agents

HAL Generalist Agent
Claude Sonnet 4.5 (September 2025)
74.5%
$187.37
HAL Generalist Agent
Claude Sonnet 4.5 High (September 2025)
70.9%
$179.86
HAL Generalist Agent
Claude Opus 4.1 High (August 2025)
68.5%
$562.24
View Full Leaderboard

Online Mind2Web

Web Assistance

Top 3 performing agents

SeeAct
GPT-5 Medium (August 2025)
42.3%
$171.07
Browser-Use
Claude Sonnet 4 (May 2025)
40.0%
$1577.26
Browser-Use
Claude Sonnet 4 High (May 2025)
39.3%
$1609.92
View Full Leaderboard

SWE-bench Verified Mini

Software Engineering

Top 3 performing agents

SWE-Agent
Claude Opus 4.1 (August 2025)
54.0%
$1789.67
SWE-Agent
Claude Opus 4.1 High (August 2025)
54.0%
$1599.90
SWE-Agent
Claude-3.7 Sonnet High (February 2025)
54.0%
$388.88
View Full Leaderboard

Scicode

Scientific Programming

Top 3 performing agents

Scicode Tool Calling Agent
o3 Medium (April 2025)
9.2%
$111.11
Scicode Zero Shot Agent
o4-mini Low (April 2025)
9.2%
$1.74
Scicode Tool Calling Agent
Claude Opus 4.1 (August 2025)
7.7%
$625.13
View Full Leaderboard

ScienceAgentBench

Scientific Programming

Top 3 performing agents

SAB Self-Debug
o3 Medium (April 2025)
33.3%
$11.69
SAB Self-Debug
Claude-3.7 Sonnet High (February 2025)
30.4%
$11.74
SAB Self-Debug
GPT-5 Medium (August 2025)
30.4%
$18.26
View Full Leaderboard

TAU-bench Airline

Customer Service

Top 3 performing agents

HAL Generalist Agent
Claude-3.7 Sonnet (February 2025)
56.0%
$42.11
HAL Generalist Agent
Claude Opus 4.1 (August 2025)
54.0%
$180.49
HAL Generalist Agent
Claude Opus 4 (May 2025)
44.0%
$150.15
View Full Leaderboard

USACO

Programming

Top 3 performing agents

USACO Episodic + Semantic
GPT-5 Medium (August 2025)
69.7%
$64.13
USACO Episodic + Semantic
o4-mini High (April 2025)
58.0%
$44.04
USACO Episodic + Semantic
Claude Opus 4.1 High (August 2025)
51.5%
$267.72
View Full Leaderboard

Who is it for?

HAL serves four key user groups in the AI ecosystem

Downstream Users & Procurers

  • • Discover useful but less known benchmarks related to tasks you care about
  • • Find out who is building strong agents on these benchmarks
  • • Identify the state of the art for both cost and accuracy on these tasks

Benchmark Developers

  • • Gain improved visibility for your benchmark
  • • Incentivize agent developers to build agents for your benchmark
  • • Enable cost-controlled evaluations by default without extra effort

Agent Developers

  • • Easily reproduce existing agents and perform unbiased comparisons
  • • Compete on a leaderboard in a straightforward way
  • • Use HAL harness for framework-agnostic agent evaluation

Safety Researchers

  • • Understanding agent capabilities on real-world safety threats and their associated costs

Cost-Controlled Agent Evaluations

Understanding the cost-performance trade-off

Why looking at the Pareto frontier matters

  • Agents can be 100x more expensive while only being 1% better
  • Downstream developers can't tell the difference from 1D leaderboards
Cost ($) Performance (%) Agent A Agent B Agent C Cost-performance frontier

The HAL Evaluation Harness

A unified framework for reproducible agent evaluation

Standardized Evaluation

  • One-stop shop evaluation harness for all benchmarks and agents
  • Flexible execution environments for running parallel evaluations locally or in the cloud

Comprehensive Logging

  • Automatic logging of agent traces with W&B Weave
  • Detailed cost tracking of token usage with minimal edits to agent code

Developer Friendly

  • Easy agent integration that does not require a specific agent framework
  • Modular architecture that allows for easy extentions with new benchmarks and agents

Agent Traces

Enabling rapid development and debugging while protecting benchmark integrity

Complete Agent Traces

We make available the full traces of agent evaluations, including every single model call as logged by W&B Weave.

Encrypted Distribution

All agent traces are encrypted to prevent benchmark contamination through automated scraping.

Meet the Team

The people behind HAL

Authors

Sayash Kapoor
Princeton University
Benedikt Stroebl
Princeton University
Peter Kirgis
Princeton University
Nitya Nadgir
Work done while at Princeton University
Zachary S Siegel
Princeton University
Boyi Wei
Princeton University
Tianci Xue
Ohio State University
Ziru Chen
Ohio State University
Felix Chen
Princeton University
Saiteja Utpala
Microsoft Research
Franck Ndzomga
Independent Researcher
Dheeraj Oruganty
Amazon
Sophie Luskin
Princeton University
Kangheng Liu
Georgetown University
Botao Yu
Ohio State University
Amit Arora
Georgetown University
Dongyoon Hahm
KAIST
Harsh Trivedi
Stony Brook University
Huan Sun
Ohio State University
Juyong Lee
KAIST
Tengjun Jin
University of Illinois Urbana-Champaign
Yifan Mai
Stanford University
Yifei Zhou
xAI
Yuxuan Zhu
University of Illinois Urbana-Champaign
Rishi Bommasani
Stanford University
Daniel Kang
University of Illinois Urbana-Champaign
Dawn Song
University of California, Berkeley
Peter Henderson
Princeton University
Yu Su
Ohio State University
Percy Liang
Stanford University
Arvind Narayanan
Princeton University

Acknowledgments

Aymeric Roucher
Hugging Face
Ayush Thakur
Weights & Biases
Hailey Schoelkopf
Anthropic
Iason Gabriel
Google DeepMind
Jelena Luketina
UK AISI
JJ Allaire
UK AISI
Laura Weidinger
Google DeepMind
Madhur Prashant
Amazon
Marius Hobbhahn
Apollo Research
Maximillian Kaufmann
UK AISI
Morgan McGuire
Weights & Biases
Omar Khattab
MIT
Parth Asawa
UC Berkeley
Shreya Shankar
UC Berkeley
Shayne Longpre
MIT
Veniamin Veselovsky
Princeton University
William Isaac
Google DeepMind
Charles Teague
UK AISI
Clémentine Fourrier
Hugging Face
Kevin Meng
Transluce

Want to Contribute?

HAL is an open-source project and we welcome contributions from the community.

Cite HAL

@Misc{hal,
title =        {Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation},
author =       {Sayash Kapoor and Benedikt Stroebl and Peter Kirgis and Nitya Nadgir and Zachary S Siegel and Boyi Wei and Tianci Xue and Ziru Chen and Felix Chen and Saiteja Utpala and Franck Ndzomga and Dheeraj Oruganty and Sophie Luskin and Kangheng Liu and Botao Yu and Amit Arora and Dongyoon Hahm and Harsh Trivedi and Huan Sun and Juyong Lee and Tengjun Jin and Yifan Mai and Yifei Zhou and Yuxuan Zhu and Rishi Bommasani and Daniel Kang and Dawn Song and Peter Henderson and Yu Su and Percy Liang and Arvind Narayanan},
howpublished = {\url{https://github.com/princeton-pli/hal-harness/}},
year =         {2025}}

Funding

HAL is funded by Open Philanthropy, Schmidt Sciences, the Princeton AI Lab and the Princeton Language and Intelligence Initiative. We are grateful to OpenAI for providing API credits to evaluate their models.