o4-mini High (April 2025)

Performance overview across all HAL benchmarks

9
Benchmarks
11
Agents
4
Pareto Benchmarks

Token Pricing

$1.1
Input Tokens
per 1M tokens
$4.4
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 23.84% $16.39 No
Corebench Hard
HAL Generalist Agent 35.56% $45.37 Yes
Corebench Hard
CORE-Agent 26.67% $61.35 No
Gaia
HF Open Deep Research 55.76% $184.87 No
Gaia
HAL Generalist Agent 54.55% $59.39 Yes
Online Mind2Web
SeeAct 32.00% $228.98 No
Online Mind2Web
Browser-Use 20.00% $297.93 No
Scicode
Scicode Zero Shot Agent 6.15% $5.37 No
Scicode
Scicode Tool Calling Agent 4.62% $66.20 No
Scienceagentbench
SAB Self-Debug 27.45% $11.18 No
Scienceagentbench
HAL Generalist Agent 21.57% $76.30 No
Swebench Verified Mini
SWE-Agent 50.00% $248.46 No
Swebench Verified Mini
HAL Generalist Agent 2.00% $32.02 No
Taubench Airline
TAU-bench Few Shot 60.00% $18.92 Yes
Taubench Airline
HAL Generalist Agent 18.00% $20.57 No
Usaco
USACO Episodic + Semantic 57.98% $44.04 Yes