HAL Generalist Agent

Agent performance overview across all HAL benchmarks

6
Benchmarks
16
Models Used
9
Pareto Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Corebench Hard
Claude-3.7 Sonnet High (February 2025) 37.78% $66.15 Yes
Corebench Hard
Claude Opus 4.1 (August 2025) 35.56% $375.11 No
Corebench Hard
o4-mini High (April 2025) 35.56% $45.37 Yes
Corebench Hard
Claude Opus 4.1 High (August 2025) 33.33% $358.47 No
Corebench Hard
Claude-3.7 Sonnet (February 2025) 31.11% $56.64 No
Corebench Hard
GPT-4.1 (April 2025) 22.22% $58.32 No
Corebench Hard
o3 Medium (April 2025) 22.22% $441.70 No
Corebench Hard
o4-mini Low (April 2025) 15.56% $22.50 No
Corebench Hard
GPT-5 Medium (August 2025) 11.11% $29.75 No
Corebench Hard
DeepSeek R1 8.89% $2.55 No
Corebench Hard
DeepSeek V3 8.89% $0.76 Yes
Corebench Hard
GPT-OSS-120B High 8.89% $2.05 No
Corebench Hard
GPT-OSS-120B 8.89% $2.79 No
Corebench Hard
Gemini 2.0 Flash 4.44% $7.06 No
Gaia
Claude Opus 4 High (May 2025) 64.85% $665.89 Yes
Gaia
Claude-3.7 Sonnet High (February 2025) 64.24% $122.49 Yes
Gaia
o4-mini Low (April 2025) 58.18% $73.26 Yes
Gaia
Claude-3.7 Sonnet (February 2025) 56.36% $130.68 No
Gaia
o4-mini High (April 2025) 54.55% $59.39 Yes
Gaia
GPT-4.1 (April 2025) 49.70% $74.19 No
Gaia
DeepSeek V3 36.36% $29.27 No
Gaia
Gemini 2.0 Flash 32.73% $7.80 Yes
Gaia
Claude Opus 4 (May 2025) 30.30% $272.76 No
Gaia
DeepSeek R1 30.30% $73.19 No
Scienceagentbench
o4-mini High (April 2025) 21.57% $76.30 No
Scienceagentbench
o4-mini Low (April 2025) 19.61% $77.32 No
Scienceagentbench
Claude-3.7 Sonnet High (February 2025) 17.65% $48.28 No
Scienceagentbench
Claude-3.7 Sonnet (February 2025) 10.78% $41.22 No
Scienceagentbench
o3 Medium (April 2025) 9.80% $155.42 No
Scienceagentbench
GPT-4.1 (April 2025) 6.86% $68.95 No
Scienceagentbench
DeepSeek V3 0.98% $55.73 No
Swebench Verified Mini
Claude Opus 4.1 High (August 2025) 46.00% $399.93 No
Swebench Verified Mini
Claude Opus 4.1 (August 2025) 42.00% $477.65 No
Swebench Verified Mini
Claude Opus 4 (May 2025) 34.00% $382.39 No
Swebench Verified Mini
Claude Opus 4 High (May 2025) 30.00% $403.42 No
Swebench Verified Mini
Claude-3.7 Sonnet (February 2025) 26.00% $117.43 No
Swebench Verified Mini
Claude-3.7 Sonnet High (February 2025) 24.00% $72.98 No
Swebench Verified Mini
GPT-5 Medium (August 2025) 12.00% $57.58 No
Swebench Verified Mini
DeepSeek V3 10.00% $30.17 No
Swebench Verified Mini
o4-mini Low (April 2025) 6.00% $87.03 No
Swebench Verified Mini
DeepSeek R1 6.00% $146.71 No
Swebench Verified Mini
GPT-4.1 (April 2025) 2.00% $51.80 No
Swebench Verified Mini
Gemini 2.0 Flash 2.00% $7.33 No
Swebench Verified Mini
o4-mini High (April 2025) 2.00% $32.02 No
Swebench Verified Mini
o3 Medium (April 2025) 0.00% $2928.55 No
Taubench Airline
Claude-3.7 Sonnet (February 2025) 56.00% $42.11 No
Taubench Airline
Claude Opus 4.1 (August 2025) 54.00% $180.49 No
Taubench Airline
Claude Opus 4 (May 2025) 44.00% $150.15 No
Taubench Airline
Claude Opus 4 High (May 2025) 44.00% $150.29 No
Taubench Airline
Claude-3.7 Sonnet High (February 2025) 44.00% $34.58 No
Taubench Airline
Claude Opus 4.1 High (August 2025) 32.00% $140.28 No
Taubench Airline
GPT-5 Medium (August 2025) 30.00% $52.78 No
Taubench Airline
Gemini 2.0 Flash 22.00% $2.00 Yes
Taubench Airline
o4-mini Low (April 2025) 22.00% $20.16 No
Taubench Airline
o3 Medium (April 2025) 20.00% $221.59 No
Taubench Airline
DeepSeek V3 18.00% $10.73 No
Taubench Airline
o4-mini High (April 2025) 18.00% $20.57 No
Taubench Airline
GPT-4.1 (April 2025) 16.00% $17.85 No
Taubench Airline
DeepSeek R1 10.00% $30.18 No
Usaco
GPT-4.1 (April 2025) 25.41% $197.33 No