DeepSeek V3
Performance overview across all HAL benchmarks
9
Benchmarks
10
Agents
2
Pareto Benchmarks
Token Pricing
$0.2
Input Tokens
per 1M tokens
$0.8
Output Tokens
per 1M tokens
Benchmark Performance
On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Agent | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Assistantbench
|
Browser-Use | 2.03% | $12.66 | No |
Corebench Hard
|
CORE-Agent | 17.78% | $25.26 | No |
Corebench Hard
|
HAL Generalist Agent | 8.89% | $0.76 | Yes |
Gaia
|
HAL Generalist Agent | 36.36% | $29.27 | No |
Gaia
|
HF Open Deep Research | 28.48% | $76.64 | No |
Online Mind2Web
|
Browser-Use | 32.33% | $214.74 | No |
Scicode
|
Scicode Zero Shot Agent | 3.08% | $0.79 | No |
Scicode
|
Scicode Tool Calling Agent | 0.00% | $52.11 | No |
Scienceagentbench
|
SAB Self-Debug | 15.69% | $2.09 | No |
Scienceagentbench
|
HAL Generalist Agent | 0.98% | $55.73 | No |
Swebench Verified Mini
|
SWE-Agent | 24.00% | $11.77 | No |
Swebench Verified Mini
|
HAL Generalist Agent | 10.00% | $30.17 | No |
Taubench Airline
|
TAU-bench Few Shot | 34.00% | $30.60 | No |
Taubench Airline
|
HAL Generalist Agent | 18.00% | $10.73 | No |
Usaco
|
USACO Episodic + Semantic | 39.09% | $12.08 | Yes |