HAL: Holistic Agent Leaderboard

The standardized, cost-aware, and third-party leaderboard for evaluating agents.

By the SAgE team at Princeton University

197
Agents
9
Benchmarks
HAL Hero

Performance Highlights

Top performing agents across different benchmarks

AssistantBench

Web Assistance

Top 3 performing agents

Browser-Use
o3 Medium (April 2025)
38.8%
$15.15
Browser-Use
GPT-5 Medium (August 2025)
35.2%
$41.69
Browser-Use
o4-mini Low (April 2025)
28.1%
$9.22
View Full Leaderboard

CORE-Bench Hard

Scientific Programming

Top 3 performing agents

CORE-Agent
Claude Opus 4.1 (August 2025)
51.1%
$412.42
CORE-Agent
Claude Opus 4.1 High (August 2025)
42.2%
$509.95
HAL Generalist Agent
Claude-3.7 Sonnet High (February 2025)
37.8%
$66.15
View Full Leaderboard

GAIA

Web Assistance

Top 3 performing agents

HAL Generalist Agent
Claude Opus 4 High (May 2025)
64.8%
$665.89
HAL Generalist Agent
Claude-3.7 Sonnet High (February 2025)
64.2%
$122.49
HF Open Deep Research
GPT-5 Medium (August 2025)
62.8%
$359.83
View Full Leaderboard

Online Mind2Web

Web Assistance

Top 3 performing agents

SeeAct
GPT-5 Medium (August 2025)
42.3%
$171.07
Browser-Use
Claude Sonnet 4 (May 2025)
40.0%
$1577.26
Browser-Use
Claude Sonnet 4 High (May 2025)
39.3%
$1609.92
View Full Leaderboard

SWE-bench Verified Mini

Software Engineering

Top 3 performing agents

SWE-Agent
Claude Opus 4.1 (August 2025)
54.0%
$1789.67
SWE-Agent
Claude Opus 4.1 High (August 2025)
54.0%
$1599.90
SWE-Agent
Claude-3.7 Sonnet High (February 2025)
54.0%
$388.88
View Full Leaderboard

Scicode

Scientific Programming

Top 3 performing agents

Scicode Tool Calling Agent
o3 Medium (April 2025)
9.2%
$111.11
Scicode Zero Shot Agent
o4-mini Low (April 2025)
9.2%
$1.74
Scicode Tool Calling Agent
Claude Opus 4.1 (August 2025)
7.7%
$625.13
View Full Leaderboard

ScienceAgentBench

Scientific Programming

Top 3 performing agents

SAB Self-Debug
o3 Medium (April 2025)
33.3%
$11.69
SAB Self-Debug
Claude-3.7 Sonnet High (February 2025)
30.4%
$11.74
SAB Self-Debug
GPT-5 Medium (August 2025)
30.4%
$18.26
View Full Leaderboard

TAU-bench Airline

Customer Service

Top 3 performing agents

TAU-bench Few Shot
Claude Opus 4 High (May 2025)
66.0%
$313.83
TAU-bench Few Shot
Claude Opus 4.1 High (August 2025)
62.0%
$298.58
TAU-bench Few Shot
Claude-3.7 Sonnet High (February 2025)
60.0%
$37.23
View Full Leaderboard

USACO

Programming

Top 3 performing agents

USACO Episodic + Semantic
GPT-5 Medium (August 2025)
69.7%
$64.13
USACO Episodic + Semantic
o4-mini High (April 2025)
58.0%
$44.04
USACO Episodic + Semantic
Claude Opus 4.1 High (August 2025)
51.5%
$267.72
View Full Leaderboard

Who is it for?

HAL serves four key user groups in the AI ecosystem

Downstream Users & Procurers

  • • Discover useful but less known benchmarks related to tasks you care about
  • • Find out who is building strong agents on these benchmarks
  • • Identify the state of the art for both cost and accuracy on these tasks

Benchmark Developers

  • • Gain improved visibility for your benchmark
  • • Incentivize agent developers to build agents for your benchmark
  • • Enable cost-controlled evaluations by default without extra effort

Agent Developers

  • • Easily reproduce existing agents and perform unbiased comparisons
  • • Compete on a leaderboard in a straightforward way
  • • Use HAL harness for framework-agnostic agent evaluation

Safety Researchers

  • • Understanding agent capabilities on real-world safety threats and their associated costs

Cost-Controlled Agent Evaluations

Understanding the cost-performance trade-off

Why looking at the Pareto frontier matters

  • Agents can be 100x more expensive while only being 1% better
  • Downstream developers can't tell the difference from 1D leaderboards
Cost ($) Performance (%) Agent A Agent B Agent C Cost-performance frontier

The HAL Evaluation Harness

A unified framework for reproducible agent evaluation

Standardized Evaluation

  • One-stop shop evaluation harness for all benchmarks and agents
  • Flexible execution environments for running parallel evaluations locally or in the cloud

Comprehensive Logging

  • Automatic logging of agent traces with W&B Weave
  • Detailed cost tracking of token usage with minimal edits to agent code

Developer Friendly

  • Easy agent integration that does not require a specific agent framework
  • Modular architecture that allows for easy extentions with new benchmarks and agents

Agent Traces

Enabling rapid development and debugging while protecting benchmark integrity

Complete Agent Traces

We make available the full traces of agent evaluations, including every single model call as logged by W&B Weave.

Encrypted Distribution

All agent traces are encrypted to prevent benchmark contamination through automated scraping.

Meet the Team

The people behind HAL

Core Team

Sayash Kapoor

PhD Student, Princeton University
Senior Fellow, Mozilla

Website →

Benedikt Stroebl

PhD Student, Princeton University

Website →

Peter Kirgis

HAL Team

Franck Stéphane Ndzomga

HAL Team

Kangheng Liu

HAL Team

Arvind Narayanan

Professor of Computer Science, Princeton University

Website →

Contributors

We're grateful to the network of contributors to HAL:

Amit Arora
Amazon
Aymeric Roucher
Hugging Face
Ayush Thakur
Weights & Biases
Boyi Wei
Princeton
Daniel Kang
UIUC
Hailey Schoelkopf
Anthropic
Harsh Trivedi
Stony Brook
Huan Sun
OSU
Iason Gabriel
Google DeepMind
Jelena Luketina
UK AISI
JJ Allaire
UK AISI
Laura Weidinger
Google DeepMind
Madhur Prashant
Amazon
Marius Hobbhahn
Apollo Research
Maximillian Kaufmann
UK AISI
Morgan McGuire
Weights & Biases
Nitya Nadgir
Brookings
Omar Khattab
MIT
Parth Asawa
UC Berkeley
Percy Liang
Stanford
Rishi Bommasani
Stanford
Shreya Shankar
UC Berkeley
Shayne Longpre
MIT
Tianci Xue
OSU
Veniamin Veselovsky
Princeton
William Isaac
Google DeepMind
Yifan Mai
Stanford
Yu Su
OSU
Zachary Siegel
Princeton
Ziru (Ron) Chen
OSU

Want to Contribute?

HAL is an open-source project and we welcome contributions from the community.

Cite HAL

@Misc{hal,
title =        {HAL: A Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation},
author =       {Sayash Kapoor and Benedikt Stroebl and Peter Kirgis and Franck Stéphane Ndzomga and Kangheng Liu and Arvind Narayanan},
howpublished = {\url{https://github.com/princeton-pli/hal-harness/}},
year =         {2025}}

Funding

HAL is funded by Open Philanthropy, Schmidt Sciences, the Princeton AI Lab and the Princeton Language and Intelligence Initiative. We are grateful to OpenAI for providing API credits to evaluate their models.