GPT-5 Medium (August 2025)

Performance overview across all HAL benchmarks

Benchmarks

Agents

Pareto Optimal Benchmarks

Token Pricing

$1.25

Input Tokens

per 1M tokens

$10

Output Tokens

per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark	Agent	Accuracy	Cost	On the Pareto Frontier?
Assistantbench	Browser-Use	35.23%	$41.69	No
Corebench Hard	CORE-Agent	26.67%	$31.76	No
Corebench Hard	HAL Generalist Agent	11.11%	$29.75	No
Gaia	HF Open Deep Research	62.80%	$359.83	No
Gaia	HAL Generalist Agent	59.39%	$104.75	No
Online Mind2Web	SeeAct	42.33%	$171.07	Yes
Online Mind2Web	Browser-Use	32.00%	$736.31	No
Scicode	Scicode Tool Calling Agent	6.15%	$193.52	No
Scienceagentbench	SAB Self-Debug	30.39%	$18.26	No
Swebench Verified Mini	SWE-Agent	46.00%	$162.93	No
Swebench Verified Mini	HAL Generalist Agent	12.00%	$57.58	No
Taubench Airline	TAU-bench Tool Calling	48.00%	$23.83	No
Taubench Airline	HAL Generalist Agent	30.00%	$52.78	No
Usaco	USACO Episodic + Semantic	69.71%	$64.13	Yes