HF Open Deep Research

Agent performance overview across all HAL benchmarks

1
Benchmarks
13
Models Used
0
Pareto Optimal Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Gaia
GPT-5 Medium (August 2025) 62.80% $359.83 No
Gaia
Claude Opus 4 (May 2025) 57.58% $1686.07 No
Gaia
o4-mini High (April 2025) 55.76% $184.87 No
Gaia
GPT-4.1 (April 2025) 50.30% $109.88 No
Gaia
o4-mini Low (April 2025) 47.88% $80.80 No
Gaia
Claude-3.7 Sonnet (February 2025) 36.97% $415.15 No
Gaia
Claude-3.7 Sonnet High (February 2025) 35.76% $113.65 No
Gaia
o3 Medium (April 2025) 32.73% $136.39 No
Gaia
Claude Opus 4.1 (August 2025) 28.48% $1306.85 No
Gaia
DeepSeek V3 28.48% $13.19 No
Gaia
Claude Opus 4.1 High (August 2025) 25.45% $1473.64 No
Gaia
DeepSeek R1 24.85% $11.10 No
Gaia
Gemini 2.0 Flash 19.39% $18.82 No