Insights
Latest updates and insights from the HAL community on Twitter
We spent the last year evaluating agents for HAL.
— Sayash Kapoor (@sayashk) September 16, 2025
My biggest learning: We live in the Windows 95 era of agent evaluation. pic.twitter.com/DeIzWm1f0c
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3! pic.twitter.com/HxGgVLkIyN
— Peter Kirgis (@PKirgis) September 12, 2025
We have added ScienceAgentBench to HAL and evaluated it with leading models (GPT-5, o3, Opus 4.1).
— Sayash Kapoor (@sayashk) September 11, 2025
o3 tops the leaderboard at a lower cost than GPT-5, Opus 4.1, and Sonnet 3.7 High. o4-mini Low is much cheaper than the crowd, but with similar accuracy.
Grateful to so many… https://t.co/Xw6iWhqDoe pic.twitter.com/1ZNXaLK7kz
Can AI agents reliably navigate the web? Does the choice of agent scaffold affect web browsing ability? To answer these questions, we added Online Mind2Web, a web browsing benchmark, to the Holistic Agent Leaderboard (HAL).
— Sayash Kapoor (@sayashk) September 3, 2025
We evaluated 9 models (including GPT-5 and Sonnet 4)… pic.twitter.com/jwS2iFG27E
GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers.
— Sayash Kapoor (@sayashk) August 12, 2025
DeepSeek V3 scores 18%.
GPT-OSS scores 11%.https://t.co/EVxxSqKFMe pic.twitter.com/tx8rTygUWw
How does GPT-5 compare against Claude Opus 4.1 on agentic tasks?
— Sayash Kapoor (@sayashk) August 8, 2025
Since their release, we have been evaluating these models on challenging science, web, service, and code tasks.
Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵 pic.twitter.com/KhVcXZN3fc
Stay Connected
Follow us on Twitter for the latest updates on agent evaluations, new benchmarks, and community insights.