SAB Self-Debug
Agent performance overview across all HAL benchmarks
1
Benchmarks
16
Models Used
4
Pareto Optimal Runs
Models Used
o3 Medium (April 2025)
Claude Sonnet 4.5 High (September 2025)
Claude-3.7 Sonnet High (February 2025)
GPT-5 Medium (August 2025)
Claude Sonnet 4.5 (September 2025)
o4-mini Low (April 2025)
o4-mini High (April 2025)
Claude Opus 4.1 (August 2025)
Claude Opus 4.1 High (August 2025)
GPT-4.1 (April 2025)
Claude Haiku 4.5 High (October 2025)
DeepSeek R1 (January 2025)
Claude-3.7 Sonnet (February 2025)
Claude Haiku 4.5 (October 2025)
DeepSeek V3 (March 2025)
Gemini 2.0 Flash (February 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
| Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
|---|---|---|---|---|
|
Scienceagentbench
|
o3 Medium (April 2025) | 33.33% | $11.69 | Yes |
|
Scienceagentbench
|
Claude Sonnet 4.5 High (September 2025) | 30.39% | $7.47 | Yes |
|
Scienceagentbench
|
Claude-3.7 Sonnet High (February 2025) | 30.39% | $11.74 | No |
|
Scienceagentbench
|
GPT-5 Medium (August 2025) | 30.39% | $18.26 | No |
|
Scienceagentbench
|
Claude Sonnet 4.5 (September 2025) | 29.41% | $7.39 | No |
|
Scienceagentbench
|
o4-mini Low (April 2025) | 27.45% | $3.95 | Yes |
|
Scienceagentbench
|
o4-mini High (April 2025) | 27.45% | $11.18 | No |
|
Scienceagentbench
|
Claude Opus 4.1 (August 2025) | 27.45% | $33.37 | No |
|
Scienceagentbench
|
Claude Opus 4.1 High (August 2025) | 26.47% | $33.75 | No |
|
Scienceagentbench
|
GPT-4.1 (April 2025) | 24.51% | $7.42 | No |
|
Scienceagentbench
|
Claude Haiku 4.5 High (October 2025) | 23.53% | $3.41 | No |
|
Scienceagentbench
|
DeepSeek R1 (January 2025) | 23.53% | $18.24 | No |
|
Scienceagentbench
|
Claude-3.7 Sonnet (February 2025) | 22.55% | $7.12 | No |
|
Scienceagentbench
|
Claude Haiku 4.5 (October 2025) | 18.63% | $2.66 | No |
|
Scienceagentbench
|
DeepSeek V3 (March 2025) | 15.69% | $2.09 | No |
|
Scienceagentbench
|
Gemini 2.0 Flash (February 2025) | 12.75% | $0.19 | Yes |