SAB Self-Debug

Agent performance overview across all HAL benchmarks

1
Benchmarks
12
Models Used
3
Pareto Optimal Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Scienceagentbench
o3 Medium (April 2025) 33.33% $11.69 Yes
Scienceagentbench
Claude-3.7 Sonnet High (February 2025) 30.39% $11.74 No
Scienceagentbench
GPT-5 Medium (August 2025) 30.39% $18.26 No
Scienceagentbench
o4-mini Low (April 2025) 27.45% $3.95 Yes
Scienceagentbench
o4-mini High (April 2025) 27.45% $11.18 No
Scienceagentbench
Claude Opus 4.1 (August 2025) 27.45% $33.37 No
Scienceagentbench
Claude Opus 4.1 High (August 2025) 26.47% $33.75 No
Scienceagentbench
GPT-4.1 (April 2025) 24.51% $7.42 No
Scienceagentbench
DeepSeek R1 (January 2025) 23.53% $18.24 No
Scienceagentbench
Claude-3.7 Sonnet (February 2025) 22.55% $7.12 No
Scienceagentbench
DeepSeek V3 (March 2025) 15.69% $2.09 No
Scienceagentbench
Gemini 2.0 Flash (February 2025) 12.75% $0.19 Yes