o3 Medium (April 2025)

Performance overview across all HAL benchmarks

9
Benchmarks
11
Agents
2
Pareto Optimal Benchmarks

Token Pricing

$2
Input Tokens
per 1M tokens
$8
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 38.81% $15.15 Yes
Corebench Hard
CORE-Agent 24.44% $120.47 No
Corebench Hard
HAL Generalist Agent 22.22% $88.34 No
Gaia
HF Open Deep Research 32.73% $136.39 No
Online Mind2Web
SeeAct 39.00% $258.74 No
Online Mind2Web
Browser-Use 29.00% $371.59 No
Scicode
Scicode Tool Calling Agent 9.23% $111.11 No
Scicode
Scicode Zero Shot Agent 4.62% $6.03 No
Scienceagentbench
SAB Self-Debug 33.33% $11.69 Yes
Scienceagentbench
HAL Generalist Agent 9.80% $31.08 No
Swebench Verified Mini
SWE-Agent 46.00% $483.43 No
Swebench Verified Mini
HAL Generalist Agent 0.00% $585.71 No
Taubench Airline
TAU-bench Few Shot 46.00% $34.14 No
Taubench Airline
HAL Generalist Agent 20.00% $45.03 No
Usaco
USACO Episodic + Semantic 46.25% $57.30 No