DeepSeek V3

Performance overview across all HAL benchmarks

9
Benchmarks
10
Agents
2
Pareto Benchmarks

Token Pricing

$0.2
Input Tokens
per 1M tokens
$0.8
Output Tokens
per 1M tokens

Benchmark Performance

On the Pareto Frontier? indicates whether this model achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Models on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Agent Accuracy Cost On the Pareto Frontier?
Assistantbench
Browser-Use 2.03% $12.66 No
Corebench Hard
CORE-Agent 17.78% $25.26 No
Corebench Hard
HAL Generalist Agent 8.89% $0.76 Yes
Gaia
HAL Generalist Agent 36.36% $29.27 No
Gaia
HF Open Deep Research 28.48% $76.64 No
Online Mind2Web
Browser-Use 32.33% $214.74 No
Scicode
Scicode Zero Shot Agent 3.08% $0.79 No
Scicode
Scicode Tool Calling Agent 0.00% $52.11 No
Scienceagentbench
SAB Self-Debug 15.69% $2.09 No
Scienceagentbench
HAL Generalist Agent 0.98% $55.73 No
Swebench Verified Mini
SWE-Agent 24.00% $11.77 No
Swebench Verified Mini
HAL Generalist Agent 10.00% $30.17 No
Taubench Airline
TAU-bench Few Shot 34.00% $30.60 No
Taubench Airline
HAL Generalist Agent 18.00% $10.73 No
Usaco
USACO Episodic + Semantic 39.09% $12.08 Yes