Scicode Tool Calling Agent
Agent performance overview across all HAL benchmarks
1
Benchmarks
12
Models Used
0
Pareto Optimal Runs
Models Used
o3 Medium (April 2025)
Claude Opus 4.1 (August 2025)
Claude Opus 4.1 High (August 2025)
GPT-5 Medium (August 2025)
o4-mini Low (April 2025)
o4-mini High (April 2025)
Claude-3.7 Sonnet High (February 2025)
Claude-3.7 Sonnet (February 2025)
Gemini 2.0 Flash (February 2025)
GPT-4.1 (April 2025)
DeepSeek V3 (March 2025)
DeepSeek R1 (May 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Scicode
|
o3 Medium (April 2025) | 9.23% | $111.11 | No |
Scicode
|
Claude Opus 4.1 (August 2025) | 7.69% | $625.13 | No |
Scicode
|
Claude Opus 4.1 High (August 2025) | 6.92% | $550.54 | No |
Scicode
|
GPT-5 Medium (August 2025) | 6.15% | $193.52 | No |
Scicode
|
o4-mini Low (April 2025) | 4.62% | $46.30 | No |
Scicode
|
o4-mini High (April 2025) | 4.62% | $66.20 | No |
Scicode
|
Claude-3.7 Sonnet High (February 2025) | 4.62% | $204.37 | No |
Scicode
|
Claude-3.7 Sonnet (February 2025) | 3.08% | $191.41 | No |
Scicode
|
Gemini 2.0 Flash (February 2025) | 1.54% | $5.23 | No |
Scicode
|
GPT-4.1 (April 2025) | 1.54% | $69.39 | No |
Scicode
|
DeepSeek V3 (March 2025) | 0.00% | $52.11 | No |
Scicode
|
DeepSeek R1 (May 2025) | 0.00% | $57.62 | No |