GPT-4.1 (April 2025)

Performance overview across all HAL benchmarks

9
Benchmarks
11
Agents
0
Pareto Optimal Benchmarks

Token Pricing

$2
Input Tokens
per 1M tokens
$8
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 17.39% $14.15 No
Corebench Hard
CORE-Agent 33.33% $107.36 No
Corebench Hard
HAL Generalist Agent 22.22% $58.32 No
Gaia
HF Open Deep Research 50.30% $109.88 No
Gaia
HAL Generalist Agent 49.70% $74.19 No
Online Mind2Web
Browser-Use 36.33% $236.62 No
Online Mind2Web
SeeAct 30.33% $271.24 No
Scicode
Scicode Zero Shot Agent 6.15% $2.82 No
Scicode
Scicode Tool Calling Agent 1.54% $69.39 No
Scienceagentbench
SAB Self-Debug 24.51% $7.42 No
Scienceagentbench
HAL Generalist Agent 6.86% $68.95 No
Swebench Verified Mini
SWE-Agent 44.00% $393.65 No
Swebench Verified Mini
HAL Generalist Agent 2.00% $51.80 No
Taubench Airline
TAU-bench Few Shot 56.00% $42.58 No
Taubench Airline
HAL Generalist Agent 16.00% $17.85 No
Usaco
USACO Episodic + Semantic 44.95% $28.10 No
Usaco
HAL Generalist Agent 25.41% $197.33 No