Claude-3.7 Sonnet High (February 2025)
Performance overview across all HAL benchmarks
9
Benchmarks
11
Agents
2
Pareto Optimal Benchmarks
Token Pricing
$3
Input Tokens
per 1M tokens
$15
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Assistantbench
|
Browser-Use | 13.08% | $16.13 | No |
Corebench Hard
|
HAL Generalist Agent | 37.78% | $66.15 | Yes |
Corebench Hard
|
CORE-Agent | 24.44% | $72.47 | No |
Gaia
|
HAL Generalist Agent | 64.24% | $122.49 | Yes |
Gaia
|
HF Open Deep Research | 35.76% | $113.65 | No |
Online Mind2Web
|
Browser-Use | 39.33% | $1151.88 | No |
Online Mind2Web
|
SeeAct | 30.33% | $367.51 | No |
Scicode
|
Scicode Tool Calling Agent | 4.62% | $204.37 | No |
Scicode
|
Scicode Zero Shot Agent | 3.08% | $4.99 | No |
Scienceagentbench
|
SAB Self-Debug | 30.39% | $11.74 | No |
Scienceagentbench
|
HAL Generalist Agent | 17.65% | $48.28 | No |
Swebench Verified Mini
|
SWE-Agent | 54.00% | $388.88 | No |
Swebench Verified Mini
|
HAL Generalist Agent | 24.00% | $72.98 | No |
Taubench Airline
|
TAU-bench Few Shot | 60.00% | $37.23 | No |
Taubench Airline
|
HAL Generalist Agent | 44.00% | $34.58 | No |
Usaco
|
USACO Episodic + Semantic | 26.71% | $56.43 | No |