HAL: Holistic Agent Leaderboard

One-stop shop evaluation harness for all benchmarks and agents
Flexible execution environments for running parallel evaluations locally or in the cloud

Comprehensive Logging

Automatic logging of agent traces with W&B Weave
Detailed cost tracking of token usage with minimal edits to agent code

Developer Friendly

Easy agent integration that does not require a specific agent framework
Modular architecture that allows for easy extentions with new benchmarks and agents

Get Started with HAL →

Agent Traces

Enabling rapid development and debugging while protecting benchmark integrity

Complete Agent Traces

We make available the full traces of agent evaluations, including every single model call as logged by W&B Weave.

Encrypted Distribution

All agent traces are encrypted to prevent benchmark contamination through automated scraping.

Download on Leaderboards →

Meet the Team

The people behind HAL

Authors

Sayash Kapoor

Princeton University

Benedikt Stroebl

Princeton University

Peter Kirgis

Princeton University

Nitya Nadgir

Independent Researcher

Zachary S Siegel

Princeton University

Boyi Wei

Princeton University

Tianci Xue

The Ohio State University

Ziru Chen

The Ohio State University

Felix Chen

Princeton University

Saiteja Utpala

Microsoft Research

Franck Ndzomga

Independent Researcher

Dheeraj Oruganty

Amazon

Sophie Luskin

Princeton University

Kangheng Liu

Georgetown University

Botao Yu

The Ohio State University

Amit Arora

Georgetown University

Dongyoon Hahm

KAIST

Harsh Trivedi

Stony Brook University

Huan Sun

The Ohio State University

Juyong Lee

KAIST

Tengjun Jin

University of Illinois Urbana-Champaign

Yifan Mai

Stanford University

Yifei Zhou

xAI

Yuxuan Zhu

University of Illinois Urbana-Champaign

Rishi Bommasani

Stanford University

Daniel Kang

University of Illinois Urbana-Champaign

Dawn Song

University of California, Berkeley

Peter Henderson

Princeton University

Yu Su

The Ohio State University

Percy Liang

Stanford University

Arvind Narayanan

Princeton University

Acknowledgments

Aymeric Roucher

Hugging Face

Ayush Thakur

Weights & Biases

Hailey Schoelkopf

Anthropic

Iason Gabriel

Google DeepMind

Jelena Luketina

UK AISI

JJ Allaire

UK AISI

Laura Weidinger

Google DeepMind

Madhur Prashant

Amazon

Marius Hobbhahn

Apollo Research

Maximillian Kaufmann

UK AISI

Morgan McGuire

Weights & Biases

Omar Khattab

MIT

Parth Asawa

UC Berkeley

Shreya Shankar

UC Berkeley

Shayne Longpre

MIT

Veniamin Veselovsky

Princeton University

William Isaac

Google DeepMind

Charles Teague

UK AISI

Clémentine Fourrier

Hugging Face

Kevin Meng

Transluce

Want to Contribute?

HAL is an open-source project and we welcome contributions from the community.

GitHub

Cite HAL

@Misc{hal,
title =        {Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation},
author =       {Sayash Kapoor and Benedikt Stroebl and Peter Kirgis and Nitya Nadgir and Zachary S Siegel and Boyi Wei and Tianci Xue and Ziru Chen and Felix Chen and Saiteja Utpala and Franck Ndzomga and Dheeraj Oruganty and Sophie Luskin and Kangheng Liu and Botao Yu and Amit Arora and Dongyoon Hahm and Harsh Trivedi and Huan Sun and Juyong Lee and Tengjun Jin and Yifan Mai and Yifei Zhou and Yuxuan Zhu and Rishi Bommasani and Daniel Kang and Dawn Song and Peter Henderson and Yu Su and Percy Liang and Arvind Narayanan},
howpublished = {\url{https://github.com/princeton-pli/hal-harness/}},
year =         {2025}}

Funding

HAL is funded by Open Philanthropy, Schmidt Sciences, the Princeton AI Lab and the Princeton Language and Intelligence Initiative. We are grateful to OpenAI for providing API credits to evaluate their models.