Scicode Zero Shot Agent
Agent performance overview across all HAL benchmarks
1
Benchmarks
9
Models Used
2
Pareto Optimal Runs
Models Used
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Scicode
|
o4-mini Low (April 2025) | 9.23% | $1.74 | Yes |
Scicode
|
GPT-4.1 (April 2025) | 6.15% | $2.82 | No |
Scicode
|
o4-mini High (April 2025) | 6.15% | $5.37 | No |
Scicode
|
o3 Medium (April 2025) | 4.62% | $6.03 | No |
Scicode
|
DeepSeek V3 (March 2025) | 3.08% | $0.79 | No |
Scicode
|
Claude-3.7 Sonnet High (February 2025) | 3.08% | $4.99 | No |
Scicode
|
Gemini 2.0 Flash (February 2025) | 1.54% | $0.12 | Yes |
Scicode
|
DeepSeek R1 (May 2025) | 0.00% | $2.19 | No |
Scicode
|
Claude-3.7 Sonnet (February 2025) | 0.00% | $5.10 | No |