GPT-4.1 (April 2025)
Performance overview across all HAL benchmarks
9
Benchmarks
11
Agents
0
Pareto Optimal Benchmarks
Token Pricing
$2
Input Tokens
per 1M tokens
$8
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Assistantbench
|
Browser-Use | 17.39% | $14.15 | No |
Corebench Hard
|
CORE-Agent | 33.33% | $107.36 | No |
Corebench Hard
|
HAL Generalist Agent | 22.22% | $58.32 | No |
Gaia
|
HF Open Deep Research | 50.30% | $109.88 | No |
Gaia
|
HAL Generalist Agent | 49.70% | $74.19 | No |
Online Mind2Web
|
Browser-Use | 36.33% | $236.62 | No |
Online Mind2Web
|
SeeAct | 30.33% | $271.24 | No |
Scicode
|
Scicode Zero Shot Agent | 6.15% | $2.82 | No |
Scicode
|
Scicode Tool Calling Agent | 1.54% | $69.39 | No |
Scienceagentbench
|
SAB Self-Debug | 24.51% | $7.42 | No |
Scienceagentbench
|
HAL Generalist Agent | 6.86% | $68.95 | No |
Swebench Verified Mini
|
SWE-Agent | 44.00% | $393.65 | No |
Swebench Verified Mini
|
HAL Generalist Agent | 2.00% | $51.80 | No |
Taubench Airline
|
TAU-bench Few Shot | 56.00% | $42.58 | No |
Taubench Airline
|
HAL Generalist Agent | 16.00% | $17.85 | No |
Usaco
|
USACO Episodic + Semantic | 44.95% | $28.10 | No |
Usaco
|
HAL Generalist Agent | 25.41% | $197.33 | No |