o4-mini High (April 2025)
Performance overview across all HAL benchmarks
9
Benchmarks
11
Agents
4
Pareto Benchmarks
Token Pricing
$1.1
Input Tokens
per 1M tokens
$4.4
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Assistantbench
|
Browser-Use | 23.84% | $16.39 | No |
Corebench Hard
|
HAL Generalist Agent | 35.56% | $45.37 | Yes |
Corebench Hard
|
CORE-Agent | 26.67% | $61.35 | No |
Gaia
|
HF Open Deep Research | 55.76% | $184.87 | No |
Gaia
|
HAL Generalist Agent | 54.55% | $59.39 | Yes |
Online Mind2Web
|
SeeAct | 32.00% | $228.98 | No |
Online Mind2Web
|
Browser-Use | 20.00% | $297.93 | No |
Scicode
|
Scicode Zero Shot Agent | 6.15% | $5.37 | No |
Scicode
|
Scicode Tool Calling Agent | 4.62% | $66.20 | No |
Scienceagentbench
|
SAB Self-Debug | 27.45% | $11.18 | No |
Scienceagentbench
|
HAL Generalist Agent | 21.57% | $76.30 | No |
Swebench Verified Mini
|
SWE-Agent | 50.00% | $248.46 | No |
Swebench Verified Mini
|
HAL Generalist Agent | 2.00% | $32.02 | No |
Taubench Airline
|
TAU-bench Few Shot | 60.00% | $18.92 | Yes |
Taubench Airline
|
HAL Generalist Agent | 18.00% | $20.57 | No |
Usaco
|
USACO Episodic + Semantic | 57.98% | $44.04 | Yes |