o3 Medium (April 2025)
Performance overview across all HAL benchmarks
9
Benchmarks
11
Agents
2
Pareto Optimal Benchmarks
Token Pricing
$2
Input Tokens
per 1M tokens
$8
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Assistantbench
|
Browser-Use | 38.81% | $15.15 | Yes |
Corebench Hard
|
CORE-Agent | 24.44% | $120.47 | No |
Corebench Hard
|
HAL Generalist Agent | 22.22% | $88.34 | No |
Gaia
|
HF Open Deep Research | 32.73% | $136.39 | No |
Online Mind2Web
|
SeeAct | 39.00% | $258.74 | No |
Online Mind2Web
|
Browser-Use | 29.00% | $371.59 | No |
Scicode
|
Scicode Tool Calling Agent | 9.23% | $111.11 | No |
Scicode
|
Scicode Zero Shot Agent | 4.62% | $6.03 | No |
Scienceagentbench
|
SAB Self-Debug | 33.33% | $11.69 | Yes |
Scienceagentbench
|
HAL Generalist Agent | 9.80% | $31.08 | No |
Swebench Verified Mini
|
SWE-Agent | 46.00% | $483.43 | No |
Swebench Verified Mini
|
HAL Generalist Agent | 0.00% | $585.71 | No |
Taubench Airline
|
TAU-bench Few Shot | 46.00% | $34.14 | No |
Taubench Airline
|
HAL Generalist Agent | 20.00% | $45.03 | No |
Usaco
|
USACO Episodic + Semantic | 46.25% | $57.30 | No |