Claude-3.7 Sonnet (February 2025)

Performance overview across all HAL benchmarks

Benchmarks

Agents

Pareto Optimal Benchmarks

Token Pricing

Input Tokens

per 1M tokens

$15

Output Tokens

per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark	Agent	Accuracy	Cost	On the Pareto Frontier?
Assistantbench	Browser-Use	16.69%	$56.00	No
Corebench Hard	CORE-Agent	35.56%	$73.04	No
Corebench Hard	HAL Generalist Agent	31.11%	$56.64	No
Gaia	HAL Generalist Agent	56.36%	$130.68	No
Gaia	HF Open Deep Research	36.97%	$415.15	No
Online Mind2Web	Browser-Use	38.33%	$926.48	No
Online Mind2Web	SeeAct	28.33%	$291.97	No
Scicode	HAL Generalist Agent	3.08%	$60.40	No
Scicode	Scicode Tool Calling Agent	3.08%	$191.41	No
Scicode	Scicode Zero Shot Agent	0.00%	$5.10	No
Scienceagentbench	SAB Self-Debug	22.55%	$7.12	No
Scienceagentbench	HAL Generalist Agent	10.78%	$41.22	No
Swebench Verified Mini	SWE-Agent	50.00%	$402.69	No
Swebench Verified Mini	HAL Generalist Agent	26.00%	$117.43	No
Taubench Airline	HAL Generalist Agent	56.00%	$42.11	No
Taubench Airline	TAU-bench Tool Calling	44.00%	$15.45	No
Usaco	USACO Episodic + Semantic	29.32%	$38.70	No