o4-mini Low (April 2025)

Performance overview across all HAL benchmarks

9
Benchmarks
11
Agents
5
Pareto Benchmarks

Token Pricing

$1.1
Input Tokens
per 1M tokens
$4.4
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 28.05% $9.22 Yes
Corebench Hard
CORE-Agent 17.78% $31.79 No
Corebench Hard
HAL Generalist Agent 15.56% $22.50 No
Gaia
HAL Generalist Agent 58.18% $73.26 Yes
Gaia
HF Open Deep Research 47.88% $80.80 No
Online Mind2Web
SeeAct 31.67% $162.36 No
Online Mind2Web
Browser-Use 18.33% $201.44 No
Scicode
Scicode Zero Shot Agent 9.23% $1.74 Yes
Scicode
Scicode Tool Calling Agent 4.62% $46.30 No
Scienceagentbench
SAB Self-Debug 27.45% $3.95 Yes
Scienceagentbench
HAL Generalist Agent 19.61% $77.32 No
Swebench Verified Mini
SWE-Agent 54.00% $259.20 Yes
Swebench Verified Mini
HAL Generalist Agent 6.00% $87.03 No
Taubench Airline
TAU-bench Few Shot 48.00% $18.81 No
Taubench Airline
HAL Generalist Agent 22.00% $20.16 No
Usaco
USACO Episodic + Semantic 30.94% $21.14 No