DeepSeek R1

Performance overview across all HAL benchmarks

9
Benchmarks
10
Agents
2
Pareto Optimal Benchmarks

Token Pricing

$0.2
Input Tokens
per 1M tokens
$0.8
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 8.75% $6.62 No
Corebench Hard
CORE-Agent 8.89% $9.05 No
Corebench Hard
HAL Generalist Agent 8.89% $2.83 No
Gaia
HAL Generalist Agent 30.30% $5.47 No
Gaia
HF Open Deep Research 24.85% $11.10 No
Online Mind2Web
Browser-Use 25.33% $21.26 No
Scicode
Scicode Tool Calling Agent 0.00% $20.98 No
Scicode
Scicode Zero Shot Agent 0.00% $0.80 No
Scienceagentbench
SAB Self-Debug 23.53% $1.87 Yes
Swebench Verified Mini
HAL Generalist Agent 6.00% $10.32 No
Swebench Verified Mini
SWE-Agent 0.00% $0.41 Yes
Taubench Airline
TAU-bench Few Shot 36.00% $5.66 No
Taubench Airline
HAL Generalist Agent 10.00% $2.91 No
Usaco
USACO Episodic + Semantic 38.11% $8.18 No