Claude-3.7 Sonnet (February 2025)
Performance overview across all HAL benchmarks
9
Benchmarks
11
Agents
0
Pareto Optimal Benchmarks
Token Pricing
$3
Input Tokens
per 1M tokens
$15
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Assistantbench
|
Browser-Use | 16.69% | $56.00 | No |
Corebench Hard
|
CORE-Agent | 35.56% | $73.04 | No |
Corebench Hard
|
HAL Generalist Agent | 31.11% | $56.64 | No |
Gaia
|
HAL Generalist Agent | 56.36% | $130.68 | No |
Gaia
|
HF Open Deep Research | 36.97% | $415.15 | No |
Online Mind2Web
|
Browser-Use | 38.33% | $926.48 | No |
Online Mind2Web
|
SeeAct | 28.33% | $291.97 | No |
Scicode
|
Scicode Tool Calling Agent | 3.08% | $191.41 | No |
Scicode
|
Scicode Zero Shot Agent | 0.00% | $5.10 | No |
Scienceagentbench
|
SAB Self-Debug | 22.55% | $7.12 | No |
Scienceagentbench
|
HAL Generalist Agent | 10.78% | $41.22 | No |
Swebench Verified Mini
|
SWE-Agent | 50.00% | $402.69 | No |
Swebench Verified Mini
|
HAL Generalist Agent | 26.00% | $117.43 | No |
Taubench Airline
|
HAL Generalist Agent | 56.00% | $42.11 | No |
Taubench Airline
|
TAU-bench Few Shot | 34.00% | $36.45 | No |
Usaco
|
USACO Episodic + Semantic | 29.32% | $38.70 | No |