Scicode Tool Calling Agent

Agent performance overview across all HAL benchmarks

1
Benchmarks
12
Models Used
0
Pareto Optimal Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Scicode
o3 Medium (April 2025) 9.23% $111.11 No
Scicode
Claude Opus 4.1 (August 2025) 7.69% $625.13 No
Scicode
Claude Opus 4.1 High (August 2025) 6.92% $550.54 No
Scicode
GPT-5 Medium (August 2025) 6.15% $193.52 No
Scicode
o4-mini Low (April 2025) 4.62% $46.30 No
Scicode
o4-mini High (April 2025) 4.62% $66.20 No
Scicode
Claude-3.7 Sonnet High (February 2025) 4.62% $204.37 No
Scicode
Claude-3.7 Sonnet (February 2025) 3.08% $191.41 No
Scicode
Gemini 2.0 Flash (February 2025) 1.54% $5.23 No
Scicode
GPT-4.1 (April 2025) 1.54% $69.39 No
Scicode
DeepSeek V3 (March 2025) 0.00% $52.11 No
Scicode
DeepSeek R1 (May 2025) 0.00% $57.62 No