CORE-Agent
Agent performance overview across all HAL benchmarks
1
Benchmarks
24
Models Used
3
Pareto Optimal Runs
Models Used
Claude Opus 4.1 (August 2025)
Claude Sonnet 4.5 High (September 2025)
Claude Opus 4.5 High (November 2025)
Claude Opus 4.5 (November 2025)
Claude Opus 4.1 High (August 2025)
Gemini 3 Pro Preview High (November 2025)
Claude Sonnet 4.5 (September 2025)
Claude-3.7 Sonnet (February 2025)
Claude Sonnet 4 High (May 2025)
GPT-4.1 (April 2025)
Claude Sonnet 4 (May 2025)
GPT-5 Medium (August 2025)
o4-mini High (April 2025)
Claude-3.7 Sonnet High (February 2025)
o3 Medium (April 2025)
Gemini 2.5 Pro Preview (March 2025)
DeepSeek V3.1 (August 2025)
DeepSeek V3 (March 2025)
o4-mini Low (April 2025)
GPT-OSS-120B (August 2025)
GPT-OSS-120B High (August 2025)
Gemini 2.0 Flash (February 2025)
Claude Haiku 4.5 (October 2025)
DeepSeek R1 (January 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
| Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
|---|---|---|---|---|
|
Corebench Hard
|
Claude Opus 4.1 (August 2025) | 51.11% | $412.42 | Yes |
|
Corebench Hard
|
Claude Sonnet 4.5 High (September 2025) | 44.44% | $92.34 | Yes |
|
Corebench Hard
|
Claude Opus 4.5 High (November 2025) | 42.22% | $152.66 | No |
|
Corebench Hard
|
Claude Opus 4.5 (November 2025) | 42.22% | $168.99 | No |
|
Corebench Hard
|
Claude Opus 4.1 High (August 2025) | 42.22% | $509.95 | No |
|
Corebench Hard
|
Gemini 3 Pro Preview High (November 2025) | 40.00% | $86.60 | No |
|
Corebench Hard
|
Claude Sonnet 4.5 (September 2025) | 37.78% | $97.15 | No |
|
Corebench Hard
|
Claude-3.7 Sonnet (February 2025) | 35.56% | $73.04 | No |
|
Corebench Hard
|
Claude Sonnet 4 High (May 2025) | 33.33% | $100.48 | No |
|
Corebench Hard
|
GPT-4.1 (April 2025) | 33.33% | $107.36 | No |
|
Corebench Hard
|
Claude Sonnet 4 (May 2025) | 28.89% | $50.27 | No |
|
Corebench Hard
|
GPT-5 Medium (August 2025) | 26.67% | $31.76 | No |
|
Corebench Hard
|
o4-mini High (April 2025) | 26.67% | $61.35 | No |
|
Corebench Hard
|
Claude-3.7 Sonnet High (February 2025) | 24.44% | $72.47 | No |
|
Corebench Hard
|
o3 Medium (April 2025) | 24.44% | $120.47 | No |
|
Corebench Hard
|
Gemini 2.5 Pro Preview (March 2025) | 22.22% | $182.34 | No |
|
Corebench Hard
|
DeepSeek V3.1 (August 2025) | 20.00% | $12.55 | Yes |
|
Corebench Hard
|
DeepSeek V3 (March 2025) | 17.78% | $25.26 | No |
|
Corebench Hard
|
o4-mini Low (April 2025) | 17.78% | $31.79 | No |
|
Corebench Hard
|
GPT-OSS-120B (August 2025) | 11.11% | $4.21 | No |
|
Corebench Hard
|
GPT-OSS-120B High (August 2025) | 11.11% | $4.21 | No |
|
Corebench Hard
|
Gemini 2.0 Flash (February 2025) | 11.11% | $12.46 | No |
|
Corebench Hard
|
Claude Haiku 4.5 (October 2025) | 11.11% | $43.93 | No |
|
Corebench Hard
|
DeepSeek R1 (January 2025) | 6.67% | $81.11 | No |