HAL: Holistic Agent Leaderboard

The standardized, cost-aware, and third-party leaderboard for evaluating agents.

By the SAgE team at Princeton University

122

Agents

Benchmarks

View Leaderboards ↓ GitHub

The One-Stop Shop for Agent Benchmarking

Simple, standardized and reproducible agent evaluations

The Problem with Agent Evaluations

Current agent evaluations suffer from:

• Inconsistent harnesses used across evaluations make it hard to compare results, reproduce evaluations, and are prone to bugs
• Running new agents is time-consuming due to missing simple evaluation frameworks
• Lack of cost tracking prevents cost-controlled evaluations
• Agent evaluations are different from LLM evaluations. Existing products like HELM and the LM eval harness provide standardized LLM evaluations. HAL is inspired by these efforts and closes the gap for agents.

For a detailed analysis of these issues, see AI agents that matter (Kapoor et al., 2024)

HAL's Solution

HAL consists of two independent components:

1. HAL Leaderboards

• Central platform for assessing agent capabilities
• Cost-controlled evaluations as default
• Detailed failure analysis and monitoring

View Leaderboards

2. HAL Harness

• Standalone evaluation harness for various benchmarks
• Easy integration of custom agents and benchmarks independent of agent framework
• Built-in logging and cost tracking

View on GitHub

HAL Leaderboards

SWE-bench Verified

Evaluating agents on resolving real-world GitHub issues

SWE-bench Verified Mini

A compact version of SWE-bench for quicker agent evaluation

USACO

Programming problems from the USA Computing Olympiad

AppWorld Normal

Evaluating interactive coding capabilities in a controlled world of apps and people

AppWorld Challenge

More challenging tasks in the AppWorld environment

CORE-Bench Easy

Computational reproducibility of scientific papers from experimental outputs

CORE-Bench Medium

Computational reproducibility of scientific papers given pre-configured environments

CORE-Bench Hard

Computational reproducibility of scientific papers given code and data

GAIA

General AI assistant benchmark

Cybench

Cybersecurity capabilities and risks of agents

AgentHarm

Evaluating AI agents on benign and harmful tool use

TAU-bench Retail

Evaluating AI agents on user-agent interaction tasks in the retail domain

TAU-bench Airline

Evaluating AI agents on user-agent interaction tasks in the airline domain

Who is it for?

HAL serves four key user groups in the AI ecosystem

Downstream Users & Procurers

• Discover useful but less known benchmarks related to tasks you care about
• Find out who is building strong agents on these benchmarks
• Identify the state of the art for both cost and accuracy on these tasks

View Leaderboards

Benchmark Developers

• Gain improved visibility for your benchmark
• Incentivize agent developers to build agents for your benchmark
• Enable cost-controlled evaluations by default without extra effort

Add a Benchmark

Agent Developers

• Easily reproduce existing agents and perform unbiased comparisons
• Compete on a leaderboard in a straightforward way
• Use HAL harness for framework-agnostic agent evaluation

Submit an Agent View Leaderboards

Safety Researchers

• Understanding agent capabilities on real-world safety threats and their associated costs
• For example, Cybench evaluations provide insights into agent performance and affordability for adversaries

View Cybench

Cost-Controlled Agent Evaluations

Understanding the cost-performance trade-off

Why looking at the Pareto frontier matters

• Agents can be 100x more expensive while only being 1% better
• Downstream developers can't tell the difference from 1D leaderboards

View full USACO leaderboard

The HAL Evaluation Harness

A unified framework for reproducible agent evaluation

Standardized Evaluation

One-stop shop evaluation harness for all benchmarks and agents
Flexible execution environments for running parallel evaluations locally or in the cloud

Comprehensive Logging

Automatic logging of agent traces with W&B Weave
Detailed cost tracking of token usage with minimal edits to agent code

Developer Friendly

Easy agent integration that does not require a specific agent framework
Modular architecture that allows for easy extentions with new benchmarks and agents

Get Started with HAL →

Agent Monitoring (Experimental)

Visibility into agent behavior and failure modes

Automated Failure Analysis

Our LLM-based automated failure analysis tool identifies recurring failure modes, helping you understand where and why agents struggle.

View failure reports →

Monitoring Reports

Get detailed breakdowns of failure categories, their frequencies, and descriptions to guide debugging and agent development.

Agent Traces

Enabling rapid development and debugging while protecting benchmark integrity

Complete Agent Traces

We make available the full traces of agent evaluations, including every single model call as logged by W&B Weave.

Encrypted Distribution

All agent traces are encrypted to prevent benchmark contamination through automated scraping.

Download on Leaderboards →

Meet the Team

The people behind HAL

Core Team

Benedikt Stroebl

PhD Student, Princeton University

Website →

Sayash Kapoor

PhD Student, Princeton University

Website →

Arvind Narayanan

Professor of Computer Science, Princeton University

Website →

Contributors

We're grateful to the network of contributors to HAL:

Amit Arora

Amazon

Aymeric Roucher

Hugging Face

Hailey Schoelkopf

Anthropic

Harsh Trivedi

Stony Brook

Iason Gabriel

Google DeepMind

Jelena Luketina

UK AISI

JJ Allaire

UK AISI

Laura Weidinger

Google DeepMind

Madhur Prashant

Amazon

Marius Hobbhahn

Apollo Research

Maximillian Kaufmann

UK AISI

Morgan McGuire

Weights & Biases

Omar Khattab

MIT

Parth Asawa

UC Berkeley

Rishi Bommasani

Stanford

Shreya Shankar

UC Berkeley

Shayne Longpre

MIT

William Isaac

Google DeepMind

Yifan Mai

Stanford

Zachary Siegel

Princeton

Want to Contribute?

HAL is an open-source project and we welcome contributions from the community.

GitHub

Cite HAL

@Misc{hal,
title =        {HAL: A Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation},
author =       {Benedikt Stroebl and Sayash Kapoor and Arvind Narayanan},
howpublished = {\url{https://github.com/princeton-pli/hal-harness/}},
year =         {2025}}

Funding

HAL is funded by the Princeton AI Lab and the Princeton Language and Intelligence Initiative.