Claude-3.7 Sonnet High (February 2025)

Performance overview across all HAL benchmarks

9
Benchmarks
11
Agents
2
Pareto Optimal Benchmarks

Token Pricing

$3
Input Tokens
per 1M tokens
$15
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 13.08% $16.13 No
Corebench Hard
HAL Generalist Agent 37.78% $66.15 Yes
Corebench Hard
CORE-Agent 24.44% $72.47 No
Gaia
HAL Generalist Agent 64.24% $122.49 Yes
Gaia
HF Open Deep Research 35.76% $113.65 No
Online Mind2Web
Browser-Use 39.33% $1151.88 No
Online Mind2Web
SeeAct 30.33% $367.51 No
Scicode
Scicode Tool Calling Agent 4.62% $204.37 No
Scicode
Scicode Zero Shot Agent 3.08% $4.99 No
Scienceagentbench
SAB Self-Debug 30.39% $11.74 No
Scienceagentbench
HAL Generalist Agent 17.65% $48.28 No
Swebench Verified Mini
SWE-Agent 54.00% $388.88 No
Swebench Verified Mini
HAL Generalist Agent 24.00% $72.98 No
Taubench Airline
TAU-bench Few Shot 60.00% $37.23 No
Taubench Airline
HAL Generalist Agent 44.00% $34.58 No
Usaco
USACO Episodic + Semantic 26.71% $56.43 No