Insights

Latest updates and insights from the HAL community on Twitter

CORE-Bench is solved (using Opus 4.5 with Claude Code)

TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses… pic.twitter.com/vF2y4Fbe40
— Sayash Kapoor (@sayashk) December 3, 2025

We spent the last year evaluating agents for HAL.

My biggest learning: We live in the Windows 95 era of agent evaluation. pic.twitter.com/DeIzWm1f0c
— Sayash Kapoor (@sayashk) September 16, 2025

OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3! pic.twitter.com/HxGgVLkIyN
— Peter Kirgis (@PKirgis) September 12, 2025

We have added ScienceAgentBench to HAL and evaluated it with leading models (GPT-5, o3, Opus 4.1).

o3 tops the leaderboard at a lower cost than GPT-5, Opus 4.1, and Sonnet 3.7 High. o4-mini Low is much cheaper than the crowd, but with similar accuracy.

Grateful to so many… https://t.co/Xw6iWhqDoe pic.twitter.com/1ZNXaLK7kz
— Sayash Kapoor (@sayashk) September 11, 2025

Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL).

We evaluated 9 models (including GPT-5 and Sonnet 4)… pic.twitter.com/jwS2iFG27E
— Sayash Kapoor (@sayashk) September 3, 2025

GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers.

DeepSeek V3 scores 18%.
GPT-OSS scores 11%.https://t.co/EVxxSqKFMe pic.twitter.com/tx8rTygUWw
— Sayash Kapoor (@sayashk) August 12, 2025

How does GPT-5 compare against Claude Opus 4.1 on agentic tasks?

Since their release, we have been evaluating these models on challenging science, web, service, and code tasks.

Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵 pic.twitter.com/KhVcXZN3fc
— Sayash Kapoor (@sayashk) August 8, 2025

Stay Connected

Follow @halevals View Leaderboards