CORE-Agent
Agent performance overview across all HAL benchmarks
1
Benchmarks
20
Models Used
3
Pareto Optimal Runs
Models Used
Claude Opus 4.1 (August 2025)
Claude Sonnet 4.5 High (September 2025)
Claude Opus 4.1 High (August 2025)
Claude Sonnet 4.5 (September 2025)
Claude-3.7 Sonnet (February 2025)
Claude Sonnet 4 High (May 2025)
GPT-4.1 (April 2025)
Claude Sonnet 4 (May 2025)
GPT-5 Medium (August 2025)
o4-mini High (April 2025)
Claude-3.7 Sonnet High (February 2025)
o3 Medium (April 2025)
Gemini 2.5 Pro Preview (March 2025)
DeepSeek V3.1 (August 2025)
DeepSeek V3 (March 2025)
o4-mini Low (April 2025)
GPT-OSS-120B (August 2025)
GPT-OSS-120B High (August 2025)
Gemini 2.0 Flash (February 2025)
DeepSeek R1 (January 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
| Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
|---|---|---|---|---|
|
Corebench Hard
|
Claude Opus 4.1 (August 2025) | 51.11% | $412.42 | Yes |
|
Corebench Hard
|
Claude Sonnet 4.5 High (September 2025) | 44.44% | $92.34 | Yes |
|
Corebench Hard
|
Claude Opus 4.1 High (August 2025) | 42.22% | $509.95 | No |
|
Corebench Hard
|
Claude Sonnet 4.5 (September 2025) | 37.78% | $97.15 | No |
|
Corebench Hard
|
Claude-3.7 Sonnet (February 2025) | 35.56% | $73.04 | No |
|
Corebench Hard
|
Claude Sonnet 4 High (May 2025) | 33.33% | $100.48 | No |
|
Corebench Hard
|
GPT-4.1 (April 2025) | 33.33% | $107.36 | No |
|
Corebench Hard
|
Claude Sonnet 4 (May 2025) | 28.89% | $50.27 | No |
|
Corebench Hard
|
GPT-5 Medium (August 2025) | 26.67% | $31.76 | No |
|
Corebench Hard
|
o4-mini High (April 2025) | 26.67% | $61.35 | No |
|
Corebench Hard
|
Claude-3.7 Sonnet High (February 2025) | 24.44% | $72.47 | No |
|
Corebench Hard
|
o3 Medium (April 2025) | 24.44% | $120.47 | No |
|
Corebench Hard
|
Gemini 2.5 Pro Preview (March 2025) | 22.22% | $182.34 | No |
|
Corebench Hard
|
DeepSeek V3.1 (August 2025) | 20.00% | $12.55 | Yes |
|
Corebench Hard
|
DeepSeek V3 (March 2025) | 17.78% | $25.26 | No |
|
Corebench Hard
|
o4-mini Low (April 2025) | 17.78% | $31.79 | No |
|
Corebench Hard
|
GPT-OSS-120B (August 2025) | 11.11% | $4.21 | No |
|
Corebench Hard
|
GPT-OSS-120B High (August 2025) | 11.11% | $4.21 | No |
|
Corebench Hard
|
Gemini 2.0 Flash (February 2025) | 11.11% | $12.46 | No |
|
Corebench Hard
|
DeepSeek R1 (January 2025) | 6.67% | $81.11 | No |