SeeAct
Agent performance overview across all HAL benchmarks
1
Benchmarks
10
Models Used
2
Pareto Optimal Runs
Models Used
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Online Mind2Web
|
GPT-5 Medium (August 2025) | 42.33% | $171.07 | Yes |
Online Mind2Web
|
o3 Medium (April 2025) | 39.00% | $258.74 | No |
Online Mind2Web
|
Claude Sonnet 4 (May 2025) | 36.67% | $246.18 | No |
Online Mind2Web
|
Claude Sonnet 4 High (May 2025) | 36.67% | $326.41 | No |
Online Mind2Web
|
o4-mini High (April 2025) | 32.00% | $228.98 | No |
Online Mind2Web
|
o4-mini Low (April 2025) | 31.67% | $162.36 | No |
Online Mind2Web
|
Claude-3.7 Sonnet High (February 2025) | 30.33% | $367.51 | No |
Online Mind2Web
|
GPT-4.1 (April 2025) | 30.33% | $271.24 | No |
Online Mind2Web
|
Claude-3.7 Sonnet (February 2025) | 28.33% | $291.97 | No |
Online Mind2Web
|
Gemini 2.0 Flash | 26.67% | $5.03 | Yes |