DeepSeek R1 (January 2025)
Performance overview across all HAL benchmarks
7
Benchmarks
8
Agents
0
Pareto Optimal Benchmarks
Token Pricing
$3
Input Tokens
per 1M tokens
$7
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
| Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
|---|---|---|---|---|
|
Corebench Hard
|
CORE-Agent | 6.67% | $81.11 | No |
|
Corebench Hard
|
HAL Generalist Agent | 2.22% | $13.87 | No |
|
Gaia
|
HAL Generalist Agent | 30.30% | $73.19 | No |
|
Gaia
|
HF Open Deep Research | 24.85% | $143.08 | No |
|
Online Mind2Web
|
Browser-Use | 25.33% | $280.93 | No |
|
Scienceagentbench
|
SAB Self-Debug | 23.53% | $18.24 | No |
|
Swebench Verified Mini
|
HAL Generalist Agent | 6.00% | $146.71 | No |
|
Swebench Verified Mini
|
SWE-Agent | 0.00% | $4.16 | No |
|
Taubench Airline
|
TAU-bench Tool Calling | 36.00% | $13.30 | No |
|
Taubench Airline
|
HAL Generalist Agent | 10.00% | $30.18 | No |
|
Usaco
|
USACO Episodic + Semantic | 38.11% | $80.04 | No |