HAL Generalist Agent
Agent performance overview across all HAL benchmarks
7
Benchmarks
25
Models Used
7
Pareto Optimal Runs
Models Used
Claude-3.7 Sonnet High (February 2025)
o4-mini High (April 2025)
Gemini 3 Pro Preview High (November 2025)
Claude Opus 4.1 (August 2025)
Claude Sonnet 4.5 (September 2025)
Claude Opus 4.5 (November 2025)
Claude Opus 4.1 High (August 2025)
Claude-3.7 Sonnet (February 2025)
Claude Opus 4.5 High (November 2025)
Claude Sonnet 4.5 High (September 2025)
GPT-4.1 (April 2025)
o3 Medium (April 2025)
o4-mini Low (April 2025)
GPT-5 Medium (August 2025)
GPT-OSS-120B High (August 2025)
GPT-OSS-120B (August 2025)
DeepSeek V3 (March 2025)
DeepSeek R1 (May 2025)
DeepSeek R1 (January 2025)
Gemini 2.0 Flash (February 2025)
Gemini 2.5 Pro Preview (March 2025)
Claude Opus 4 High (May 2025)
Claude Haiku 4.5 (October 2025)
Claude Opus 4 (May 2025)
Claude Haiku 4.5 High (October 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
| Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
|---|---|---|---|---|
|
Corebench Hard
|
Claude-3.7 Sonnet High (February 2025) | 37.78% | $66.15 | No |
|
Corebench Hard
|
o4-mini High (April 2025) | 35.56% | $45.37 | Yes |
|
Corebench Hard
|
Gemini 3 Pro Preview High (November 2025) | 35.56% | $101.27 | No |
|
Corebench Hard
|
Claude Opus 4.1 (August 2025) | 35.56% | $375.11 | No |
|
Corebench Hard
|
Claude Sonnet 4.5 (September 2025) | 33.33% | $85.19 | No |
|
Corebench Hard
|
Claude Opus 4.5 (November 2025) | 33.33% | $127.41 | No |
|
Corebench Hard
|
Claude Opus 4.1 High (August 2025) | 33.33% | $358.47 | No |
|
Corebench Hard
|
Claude-3.7 Sonnet (February 2025) | 31.11% | $56.64 | No |
|
Corebench Hard
|
Claude Opus 4.5 High (November 2025) | 31.11% | $112.38 | No |
|
Corebench Hard
|
Claude Sonnet 4.5 High (September 2025) | 28.89% | $87.77 | No |
|
Corebench Hard
|
GPT-4.1 (April 2025) | 22.22% | $58.32 | No |
|
Corebench Hard
|
o3 Medium (April 2025) | 22.22% | $88.34 | No |
|
Corebench Hard
|
o4-mini Low (April 2025) | 15.56% | $22.50 | No |
|
Corebench Hard
|
GPT-5 Medium (August 2025) | 11.11% | $29.75 | No |
|
Corebench Hard
|
GPT-OSS-120B High (August 2025) | 8.89% | $2.05 | Yes |
|
Corebench Hard
|
GPT-OSS-120B (August 2025) | 8.89% | $2.79 | No |
|
Corebench Hard
|
DeepSeek V3 (March 2025) | 8.89% | $4.69 | No |
|
Corebench Hard
|
DeepSeek R1 (May 2025) | 8.89% | $7.77 | No |
|
Corebench Hard
|
DeepSeek R1 (January 2025) | 4.45% | $24.95 | No |
|
Corebench Hard
|
Gemini 2.0 Flash (February 2025) | 4.44% | $7.06 | No |
|
Corebench Hard
|
Gemini 2.5 Pro Preview (March 2025) | 4.44% | $30.38 | No |
|
Gaia
|
Claude Sonnet 4.5 (September 2025) | 74.55% | $178.20 | Yes |
|
Gaia
|
Claude Sonnet 4.5 High (September 2025) | 70.91% | $179.86 | No |
|
Gaia
|
Claude Opus 4.1 High (August 2025) | 68.48% | $562.24 | No |
|
Gaia
|
Claude Opus 4 High (May 2025) | 64.85% | $665.89 | No |
|
Gaia
|
Claude-3.7 Sonnet High (February 2025) | 64.24% | $122.49 | No |
|
Gaia
|
Claude Opus 4.1 (August 2025) | 64.24% | $641.86 | No |
|
Gaia
|
GPT-5 Medium (August 2025) | 59.39% | $104.75 | No |
|
Gaia
|
o4-mini Low (April 2025) | 58.18% | $73.26 | Yes |
|
Gaia
|
Claude-3.7 Sonnet (February 2025) | 56.36% | $130.68 | No |
|
Gaia
|
Claude Haiku 4.5 (October 2025) | 56.36% | $130.81 | No |
|
Gaia
|
o4-mini High (April 2025) | 54.55% | $59.39 | Yes |
|
Gaia
|
GPT-4.1 (April 2025) | 49.70% | $74.19 | No |
|
Gaia
|
Gemini 2.0 Flash (February 2025) | 32.73% | $7.80 | Yes |
|
Gaia
|
DeepSeek R1 (January 2025) | 30.30% | $73.19 | No |
|
Gaia
|
Claude Opus 4 (May 2025) | 30.30% | $272.76 | No |
|
Gaia
|
DeepSeek V3 (March 2025) | 29.39% | $17.40 | No |
|
Gaia
|
o3 Medium (April 2025) | 28.48% | $2828.54 | No |
|
Scicode
|
o4-mini Low (April 2025) | 6.15% | $165.90 | No |
|
Scicode
|
Claude-3.7 Sonnet (February 2025) | 3.08% | $60.40 | No |
|
Scicode
|
o3 Medium (April 2025) | 3.08% | $66.98 | No |
|
Scicode
|
Claude-3.7 Sonnet High (February 2025) | 3.08% | $188.15 | No |
|
Scicode
|
GPT-4.1 (April 2025) | 1.54% | $73.87 | No |
|
Scicode
|
o4-mini High (April 2025) | 1.54% | $92.10 | No |
|
Scicode
|
Gemini 2.0 Flash (February 2025) | 0.00% | $61.49 | No |
|
Scicode
|
DeepSeek V3 (March 2025) | 0.00% | $219.36 | No |
|
Scicode
|
DeepSeek R1 (January 2025) | 0.00% | $486.78 | No |
|
Scienceagentbench
|
o4-mini High (April 2025) | 21.57% | $76.30 | No |
|
Scienceagentbench
|
o4-mini Low (April 2025) | 19.61% | $77.32 | No |
|
Scienceagentbench
|
Claude-3.7 Sonnet High (February 2025) | 17.65% | $48.28 | No |
|
Scienceagentbench
|
Claude-3.7 Sonnet (February 2025) | 10.78% | $41.22 | No |
|
Scienceagentbench
|
o3 Medium (April 2025) | 9.80% | $31.08 | No |
|
Scienceagentbench
|
GPT-4.1 (April 2025) | 6.86% | $68.95 | No |
|
Scienceagentbench
|
DeepSeek V3 (March 2025) | 0.98% | $55.73 | No |
|
Swebench Verified Mini
|
Claude Opus 4.1 High (August 2025) | 46.00% | $399.93 | No |
|
Swebench Verified Mini
|
Claude Haiku 4.5 High (October 2025) | 44.00% | $65.31 | Yes |
|
Swebench Verified Mini
|
Claude Opus 4.1 (August 2025) | 42.00% | $477.65 | No |
|
Swebench Verified Mini
|
Claude Sonnet 4.5 High (September 2025) | 40.00% | $95.97 | No |
|
Swebench Verified Mini
|
Claude Sonnet 4.5 (September 2025) | 34.00% | $128.19 | No |
|
Swebench Verified Mini
|
Claude Opus 4 (May 2025) | 34.00% | $382.39 | No |
|
Swebench Verified Mini
|
Claude Opus 4 High (May 2025) | 30.00% | $403.42 | No |
|
Swebench Verified Mini
|
Claude-3.7 Sonnet (February 2025) | 26.00% | $117.43 | No |
|
Swebench Verified Mini
|
Claude-3.7 Sonnet High (February 2025) | 24.00% | $72.98 | No |
|
Swebench Verified Mini
|
Claude Haiku 4.5 (October 2025) | 24.00% | $147.89 | No |
|
Swebench Verified Mini
|
GPT-5 Medium (August 2025) | 12.00% | $57.58 | No |
|
Swebench Verified Mini
|
DeepSeek V3 (March 2025) | 10.00% | $30.17 | No |
|
Swebench Verified Mini
|
o4-mini Low (April 2025) | 6.00% | $87.03 | No |
|
Swebench Verified Mini
|
DeepSeek R1 (January 2025) | 6.00% | $146.71 | No |
|
Swebench Verified Mini
|
Gemini 2.0 Flash (February 2025) | 2.00% | $7.33 | No |
|
Swebench Verified Mini
|
o4-mini High (April 2025) | 2.00% | $32.02 | No |
|
Swebench Verified Mini
|
GPT-4.1 (April 2025) | 2.00% | $51.80 | No |
|
Swebench Verified Mini
|
o3 Medium (April 2025) | 0.00% | $585.71 | No |
|
Taubench Airline
|
Claude-3.7 Sonnet (February 2025) | 56.00% | $42.11 | No |
|
Taubench Airline
|
Claude Opus 4.1 (August 2025) | 54.00% | $180.49 | No |
|
Taubench Airline
|
Claude-3.7 Sonnet High (February 2025) | 44.00% | $34.58 | No |
|
Taubench Airline
|
Claude Opus 4 (May 2025) | 44.00% | $150.15 | No |
|
Taubench Airline
|
Claude Opus 4 High (May 2025) | 44.00% | $150.29 | No |
|
Taubench Airline
|
Claude Opus 4.1 High (August 2025) | 32.00% | $140.28 | No |
|
Taubench Airline
|
GPT-5 Medium (August 2025) | 30.00% | $52.78 | No |
|
Taubench Airline
|
Gemini 2.0 Flash (February 2025) | 22.00% | $2.00 | No |
|
Taubench Airline
|
o4-mini Low (April 2025) | 22.00% | $20.16 | No |
|
Taubench Airline
|
o3 Medium (April 2025) | 20.00% | $45.03 | No |
|
Taubench Airline
|
DeepSeek V3 (March 2025) | 18.00% | $10.73 | No |
|
Taubench Airline
|
o4-mini High (April 2025) | 18.00% | $20.57 | No |
|
Taubench Airline
|
GPT-4.1 (April 2025) | 16.00% | $17.85 | No |
|
Taubench Airline
|
DeepSeek R1 (January 2025) | 10.00% | $30.18 | No |
|
Usaco
|
GPT-4.1 (April 2025) | 25.41% | $197.33 | No |