USACO Episodic + Semantic
Agent performance overview across all HAL benchmarks
1
Benchmarks
12
Models Used
3
Pareto Optimal Runs
Models Used
GPT-5 Medium (August 2025)
o4-mini High (April 2025)
Claude Opus 4.1 High (August 2025)
Claude Opus 4.1 (August 2025)
o3 Medium (April 2025)
GPT-4.1 (April 2025)
DeepSeek V3
DeepSeek R1
o4-mini Low (April 2025)
Claude-3.7 Sonnet (February 2025)
Gemini 2.0 Flash
Claude-3.7 Sonnet High (February 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Usaco
|
GPT-5 Medium (August 2025) | 69.71% | $64.13 | Yes |
Usaco
|
o4-mini High (April 2025) | 57.98% | $44.04 | No |
Usaco
|
Claude Opus 4.1 High (August 2025) | 51.47% | $267.72 | No |
Usaco
|
Claude Opus 4.1 (August 2025) | 48.21% | $276.19 | No |
Usaco
|
o3 Medium (April 2025) | 46.25% | $57.30 | No |
Usaco
|
GPT-4.1 (April 2025) | 44.95% | $28.10 | No |
Usaco
|
DeepSeek V3 | 39.09% | $2.78 | Yes |
Usaco
|
DeepSeek R1 | 38.11% | $8.18 | No |
Usaco
|
o4-mini Low (April 2025) | 30.94% | $21.14 | No |
Usaco
|
Claude-3.7 Sonnet (February 2025) | 29.32% | $38.70 | No |
Usaco
|
Gemini 2.0 Flash | 27.04% | $1.46 | Yes |
Usaco
|
Claude-3.7 Sonnet High (February 2025) | 26.71% | $56.43 | No |