Claude Code
Agent performance overview across all HAL benchmarks
1
Benchmarks
4
Models Used
1
Pareto Optimal Runs
Models Used
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
| Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
|---|---|---|---|---|
|
Corebench Hard
|
Claude Opus 4.5 | 77.78% | $87.16 | Yes |
|
Corebench Hard
|
Claude Sonnet 4.5 (September 2025) | 62.22% | $68.33 | No |
|
Corebench Hard
|
Claude Sonnet 4 (May 2025) | 46.67% | $65.58 | No |
|
Corebench Hard
|
Claude Opus 4.1 | 42.22% | $331.79 | No |