HAL Generalist Agent
Agent performance overview across all HAL benchmarks
6
Benchmarks
16
Models Used
9
Pareto Runs
Models Used
Claude-3.7 Sonnet High (February 2025)
Claude Opus 4.1 (August 2025)
o4-mini High (April 2025)
Claude Opus 4.1 High (August 2025)
Claude-3.7 Sonnet (February 2025)
GPT-4.1 (April 2025)
o3 Medium (April 2025)
o4-mini Low (April 2025)
GPT-5 Medium (August 2025)
DeepSeek R1
DeepSeek V3
GPT-OSS-120B High
GPT-OSS-120B
Gemini 2.0 Flash
Claude Opus 4 High (May 2025)
Claude Opus 4 (May 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Corebench Hard
|
Claude-3.7 Sonnet High (February 2025) | 37.78% | $66.15 | Yes |
Corebench Hard
|
Claude Opus 4.1 (August 2025) | 35.56% | $375.11 | No |
Corebench Hard
|
o4-mini High (April 2025) | 35.56% | $45.37 | Yes |
Corebench Hard
|
Claude Opus 4.1 High (August 2025) | 33.33% | $358.47 | No |
Corebench Hard
|
Claude-3.7 Sonnet (February 2025) | 31.11% | $56.64 | No |
Corebench Hard
|
GPT-4.1 (April 2025) | 22.22% | $58.32 | No |
Corebench Hard
|
o3 Medium (April 2025) | 22.22% | $441.70 | No |
Corebench Hard
|
o4-mini Low (April 2025) | 15.56% | $22.50 | No |
Corebench Hard
|
GPT-5 Medium (August 2025) | 11.11% | $29.75 | No |
Corebench Hard
|
DeepSeek R1 | 8.89% | $2.55 | No |
Corebench Hard
|
DeepSeek V3 | 8.89% | $0.76 | Yes |
Corebench Hard
|
GPT-OSS-120B High | 8.89% | $2.05 | No |
Corebench Hard
|
GPT-OSS-120B | 8.89% | $2.79 | No |
Corebench Hard
|
Gemini 2.0 Flash | 4.44% | $7.06 | No |
Gaia
|
Claude Opus 4 High (May 2025) | 64.85% | $665.89 | Yes |
Gaia
|
Claude-3.7 Sonnet High (February 2025) | 64.24% | $122.49 | Yes |
Gaia
|
o4-mini Low (April 2025) | 58.18% | $73.26 | Yes |
Gaia
|
Claude-3.7 Sonnet (February 2025) | 56.36% | $130.68 | No |
Gaia
|
o4-mini High (April 2025) | 54.55% | $59.39 | Yes |
Gaia
|
GPT-4.1 (April 2025) | 49.70% | $74.19 | No |
Gaia
|
DeepSeek V3 | 36.36% | $29.27 | No |
Gaia
|
Gemini 2.0 Flash | 32.73% | $7.80 | Yes |
Gaia
|
Claude Opus 4 (May 2025) | 30.30% | $272.76 | No |
Gaia
|
DeepSeek R1 | 30.30% | $73.19 | No |
Scienceagentbench
|
o4-mini High (April 2025) | 21.57% | $76.30 | No |
Scienceagentbench
|
o4-mini Low (April 2025) | 19.61% | $77.32 | No |
Scienceagentbench
|
Claude-3.7 Sonnet High (February 2025) | 17.65% | $48.28 | No |
Scienceagentbench
|
Claude-3.7 Sonnet (February 2025) | 10.78% | $41.22 | No |
Scienceagentbench
|
o3 Medium (April 2025) | 9.80% | $155.42 | No |
Scienceagentbench
|
GPT-4.1 (April 2025) | 6.86% | $68.95 | No |
Scienceagentbench
|
DeepSeek V3 | 0.98% | $55.73 | No |
Swebench Verified Mini
|
Claude Opus 4.1 High (August 2025) | 46.00% | $399.93 | No |
Swebench Verified Mini
|
Claude Opus 4.1 (August 2025) | 42.00% | $477.65 | No |
Swebench Verified Mini
|
Claude Opus 4 (May 2025) | 34.00% | $382.39 | No |
Swebench Verified Mini
|
Claude Opus 4 High (May 2025) | 30.00% | $403.42 | No |
Swebench Verified Mini
|
Claude-3.7 Sonnet (February 2025) | 26.00% | $117.43 | No |
Swebench Verified Mini
|
Claude-3.7 Sonnet High (February 2025) | 24.00% | $72.98 | No |
Swebench Verified Mini
|
GPT-5 Medium (August 2025) | 12.00% | $57.58 | No |
Swebench Verified Mini
|
DeepSeek V3 | 10.00% | $30.17 | No |
Swebench Verified Mini
|
o4-mini Low (April 2025) | 6.00% | $87.03 | No |
Swebench Verified Mini
|
DeepSeek R1 | 6.00% | $146.71 | No |
Swebench Verified Mini
|
GPT-4.1 (April 2025) | 2.00% | $51.80 | No |
Swebench Verified Mini
|
Gemini 2.0 Flash | 2.00% | $7.33 | No |
Swebench Verified Mini
|
o4-mini High (April 2025) | 2.00% | $32.02 | No |
Swebench Verified Mini
|
o3 Medium (April 2025) | 0.00% | $2928.55 | No |
Taubench Airline
|
Claude-3.7 Sonnet (February 2025) | 56.00% | $42.11 | No |
Taubench Airline
|
Claude Opus 4.1 (August 2025) | 54.00% | $180.49 | No |
Taubench Airline
|
Claude Opus 4 (May 2025) | 44.00% | $150.15 | No |
Taubench Airline
|
Claude Opus 4 High (May 2025) | 44.00% | $150.29 | No |
Taubench Airline
|
Claude-3.7 Sonnet High (February 2025) | 44.00% | $34.58 | No |
Taubench Airline
|
Claude Opus 4.1 High (August 2025) | 32.00% | $140.28 | No |
Taubench Airline
|
GPT-5 Medium (August 2025) | 30.00% | $52.78 | No |
Taubench Airline
|
Gemini 2.0 Flash | 22.00% | $2.00 | Yes |
Taubench Airline
|
o4-mini Low (April 2025) | 22.00% | $20.16 | No |
Taubench Airline
|
o3 Medium (April 2025) | 20.00% | $221.59 | No |
Taubench Airline
|
DeepSeek V3 | 18.00% | $10.73 | No |
Taubench Airline
|
o4-mini High (April 2025) | 18.00% | $20.57 | No |
Taubench Airline
|
GPT-4.1 (April 2025) | 16.00% | $17.85 | No |
Taubench Airline
|
DeepSeek R1 | 10.00% | $30.18 | No |
Usaco
|
GPT-4.1 (April 2025) | 25.41% | $197.33 | No |