CORE-Agent

Agent performance overview across all HAL benchmarks

1
Benchmarks
20
Models Used
3
Pareto Optimal Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Corebench Hard
Claude Opus 4.1 (August 2025) 51.11% $412.42 Yes
Corebench Hard
Claude Sonnet 4.5 High (September 2025) 44.44% $92.34 Yes
Corebench Hard
Claude Opus 4.1 High (August 2025) 42.22% $509.95 No
Corebench Hard
Claude Sonnet 4.5 (September 2025) 37.78% $97.15 No
Corebench Hard
Claude-3.7 Sonnet (February 2025) 35.56% $73.04 No
Corebench Hard
Claude Sonnet 4 High (May 2025) 33.33% $100.48 No
Corebench Hard
GPT-4.1 (April 2025) 33.33% $107.36 No
Corebench Hard
Claude Sonnet 4 (May 2025) 28.89% $50.27 No
Corebench Hard
GPT-5 Medium (August 2025) 26.67% $31.76 No
Corebench Hard
o4-mini High (April 2025) 26.67% $61.35 No
Corebench Hard
Claude-3.7 Sonnet High (February 2025) 24.44% $72.47 No
Corebench Hard
o3 Medium (April 2025) 24.44% $120.47 No
Corebench Hard
Gemini 2.5 Pro Preview (March 2025) 22.22% $182.34 No
Corebench Hard
DeepSeek V3.1 (August 2025) 20.00% $12.55 Yes
Corebench Hard
DeepSeek V3 (March 2025) 17.78% $25.26 No
Corebench Hard
o4-mini Low (April 2025) 17.78% $31.79 No
Corebench Hard
GPT-OSS-120B (August 2025) 11.11% $4.21 No
Corebench Hard
GPT-OSS-120B High (August 2025) 11.11% $4.21 No
Corebench Hard
Gemini 2.0 Flash (February 2025) 11.11% $12.46 No
Corebench Hard
DeepSeek R1 (January 2025) 6.67% $81.11 No