Claude Code

Agent performance overview across all HAL benchmarks

Benchmarks

Models Used

Pareto Optimal Runs

Models Used

Claude Opus 4.5 Claude Sonnet 4.5 (September 2025) Claude Sonnet 4 (May 2025) Claude Opus 4.1

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark	Model	Accuracy	Cost	On the Pareto Frontier?
Corebench Hard	Claude Opus 4.5	77.78%	$87.16	Yes
Corebench Hard	Claude Sonnet 4.5 (September 2025)	62.22%	$68.33	No
Corebench Hard	Claude Sonnet 4 (May 2025)	46.67%	$65.58	No
Corebench Hard	Claude Opus 4.1	42.22%	$331.79	No