SeeAct

Agent performance overview across all HAL benchmarks

1
Benchmarks
10
Models Used
2
Pareto Optimal Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Online Mind2Web
GPT-5 Medium (August 2025) 42.33% $171.07 Yes
Online Mind2Web
o3 Medium (April 2025) 39.00% $258.74 No
Online Mind2Web
Claude Sonnet 4 (May 2025) 36.67% $246.18 No
Online Mind2Web
Claude Sonnet 4 High (May 2025) 36.67% $326.41 No
Online Mind2Web
o4-mini High (April 2025) 32.00% $228.98 No
Online Mind2Web
o4-mini Low (April 2025) 31.67% $162.36 No
Online Mind2Web
Claude-3.7 Sonnet High (February 2025) 30.33% $367.51 No
Online Mind2Web
GPT-4.1 (April 2025) 30.33% $271.24 No
Online Mind2Web
Claude-3.7 Sonnet (February 2025) 28.33% $291.97 No
Online Mind2Web
Gemini 2.0 Flash 26.67% $5.03 Yes