GPT-5 Medium (August 2025)

Performance overview across all HAL benchmarks

9
Benchmarks
10
Agents
3
Pareto Optimal Benchmarks

Token Pricing

$1.25
Input Tokens
per 1M tokens
$10
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 35.23% $41.69 No
Corebench Hard
CORE-Agent 26.67% $31.76 No
Corebench Hard
HAL Generalist Agent 11.11% $29.75 No
Gaia
HF Open Deep Research 62.80% $359.83 No
Online Mind2Web
SeeAct 42.33% $171.07 Yes
Online Mind2Web
Browser-Use 32.00% $736.31 No
Scicode
Scicode Tool Calling Agent 6.15% $193.52 No
Scienceagentbench
SAB Self-Debug 30.39% $18.26 No
Swebench Verified Mini
SWE-Agent 46.00% $162.93 Yes
Swebench Verified Mini
HAL Generalist Agent 12.00% $57.58 No
Taubench Airline
TAU-bench Few Shot 52.00% $35.49 No
Taubench Airline
HAL Generalist Agent 30.00% $52.78 No
Usaco
USACO Episodic + Semantic 69.71% $64.13 Yes