o3 Medium (April 2025)

Performance overview across all HAL benchmarks

Benchmarks

Agents

Pareto Optimal Benchmarks

Token Pricing

Input Tokens

per 1M tokens

Output Tokens

per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark	Agent	Accuracy	Cost	On the Pareto Frontier?
Assistantbench	Browser-Use	38.81%	$15.15	Yes
Corebench Hard	CORE-Agent	24.44%	$120.47	No
Corebench Hard	HAL Generalist Agent	22.22%	$88.34	No
Gaia	HF Open Deep Research	32.73%	$136.39	No
Gaia	HAL Generalist Agent	28.48%	$2828.54	No
Online Mind2Web	SeeAct	39.00%	$258.74	No
Online Mind2Web	Browser-Use	29.00%	$371.59	No
Scicode	Scicode Tool Calling Agent	9.23%	$111.11	No
Scicode	Scicode Zero Shot Agent	4.62%	$6.03	No
Scicode	HAL Generalist Agent	3.08%	$66.98	No
Scienceagentbench	SAB Self-Debug	33.33%	$11.69	Yes
Scienceagentbench	HAL Generalist Agent	9.80%	$31.08	No
Swebench Verified Mini	SWE-Agent	46.00%	$483.43	No
Swebench Verified Mini	HAL Generalist Agent	0.00%	$585.71	No
Taubench Airline	TAU-bench Tool Calling	54.00%	$14.56	No
Taubench Airline	HAL Generalist Agent	20.00%	$45.03	No
Usaco	USACO Episodic + Semantic	46.25%	$57.30	No