GPT-5 Medium (August 2025)
Performance overview across all HAL benchmarks
9
Benchmarks
10
Agents
3
Pareto Optimal Benchmarks
Token Pricing
$1.25
Input Tokens
per 1M tokens
$10
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Assistantbench
|
Browser-Use | 35.23% | $41.69 | No |
Corebench Hard
|
CORE-Agent | 26.67% | $31.76 | No |
Corebench Hard
|
HAL Generalist Agent | 11.11% | $29.75 | No |
Gaia
|
HF Open Deep Research | 62.80% | $359.83 | No |
Online Mind2Web
|
SeeAct | 42.33% | $171.07 | Yes |
Online Mind2Web
|
Browser-Use | 32.00% | $736.31 | No |
Scicode
|
Scicode Tool Calling Agent | 6.15% | $193.52 | No |
Scienceagentbench
|
SAB Self-Debug | 30.39% | $18.26 | No |
Swebench Verified Mini
|
SWE-Agent | 46.00% | $162.93 | Yes |
Swebench Verified Mini
|
HAL Generalist Agent | 12.00% | $57.58 | No |
Taubench Airline
|
TAU-bench Few Shot | 52.00% | $35.49 | No |
Taubench Airline
|
HAL Generalist Agent | 30.00% | $52.78 | No |
Usaco
|
USACO Episodic + Semantic | 69.71% | $64.13 | Yes |