Claude-3.7 Sonnet (February 2025)

Performance overview across all HAL benchmarks

9
Benchmarks
11
Agents
0
Pareto Optimal Benchmarks

Token Pricing

$3
Input Tokens
per 1M tokens
$15
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 16.69% $56.00 No
Corebench Hard
CORE-Agent 35.56% $73.04 No
Corebench Hard
HAL Generalist Agent 31.11% $56.64 No
Gaia
HAL Generalist Agent 56.36% $130.68 No
Gaia
HF Open Deep Research 36.97% $415.15 No
Online Mind2Web
Browser-Use 38.33% $926.48 No
Online Mind2Web
SeeAct 28.33% $291.97 No
Scicode
Scicode Tool Calling Agent 3.08% $191.41 No
Scicode
Scicode Zero Shot Agent 0.00% $5.10 No
Scienceagentbench
SAB Self-Debug 22.55% $7.12 No
Scienceagentbench
HAL Generalist Agent 10.78% $41.22 No
Swebench Verified Mini
SWE-Agent 50.00% $402.69 No
Swebench Verified Mini
HAL Generalist Agent 26.00% $117.43 No
Taubench Airline
HAL Generalist Agent 56.00% $42.11 No
Taubench Airline
TAU-bench Few Shot 34.00% $36.45 No
Usaco
USACO Episodic + Semantic 29.32% $38.70 No