HF Open Deep Research
Agent performance overview across all HAL benchmarks
1
Benchmarks
13
Models Used
0
Pareto Optimal Runs
Models Used
GPT-5 Medium (August 2025)
Claude Opus 4 (May 2025)
o4-mini High (April 2025)
GPT-4.1 (April 2025)
o4-mini Low (April 2025)
Claude-3.7 Sonnet (February 2025)
Claude-3.7 Sonnet High (February 2025)
o3 Medium (April 2025)
Claude Opus 4.1 (August 2025)
DeepSeek V3
Claude Opus 4.1 High (August 2025)
DeepSeek R1
Gemini 2.0 Flash
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Gaia
|
GPT-5 Medium (August 2025) | 62.80% | $359.83 | No |
Gaia
|
Claude Opus 4 (May 2025) | 57.58% | $1686.07 | No |
Gaia
|
o4-mini High (April 2025) | 55.76% | $184.87 | No |
Gaia
|
GPT-4.1 (April 2025) | 50.30% | $109.88 | No |
Gaia
|
o4-mini Low (April 2025) | 47.88% | $80.80 | No |
Gaia
|
Claude-3.7 Sonnet (February 2025) | 36.97% | $415.15 | No |
Gaia
|
Claude-3.7 Sonnet High (February 2025) | 35.76% | $113.65 | No |
Gaia
|
o3 Medium (April 2025) | 32.73% | $136.39 | No |
Gaia
|
Claude Opus 4.1 (August 2025) | 28.48% | $1306.85 | No |
Gaia
|
DeepSeek V3 | 28.48% | $13.19 | No |
Gaia
|
Claude Opus 4.1 High (August 2025) | 25.45% | $1473.64 | No |
Gaia
|
DeepSeek R1 | 24.85% | $11.10 | No |
Gaia
|
Gemini 2.0 Flash | 19.39% | $18.82 | No |