TAU-bench Tool Calling
Agent performance overview across all HAL benchmarks
1
Benchmarks
12
Models Used
2
Pareto Optimal Runs
Models Used
o4-mini High (April 2025)
o3 Medium (April 2025)
Claude-3.7 Sonnet High (February 2025)
Claude Opus 4.1 High (August 2025)
Claude Opus 4.1 (August 2025)
GPT-5 Medium (August 2025)
DeepSeek V3 (March 2025)
Claude-3.7 Sonnet (February 2025)
o4-mini Low (April 2025)
GPT-4.1 (April 2025)
DeepSeek R1 (January 2025)
Gemini 2.0 Flash High (February 2025)
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
| Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
|---|---|---|---|---|
|
Taubench Airline
|
o4-mini High (April 2025) | 56.00% | $11.36 | Yes |
|
Taubench Airline
|
o3 Medium (April 2025) | 54.00% | $14.56 | No |
|
Taubench Airline
|
Claude-3.7 Sonnet High (February 2025) | 52.00% | $31.94 | No |
|
Taubench Airline
|
Claude Opus 4.1 High (August 2025) | 52.00% | $149.98 | No |
|
Taubench Airline
|
Claude Opus 4.1 (August 2025) | 50.00% | $69.78 | No |
|
Taubench Airline
|
GPT-5 Medium (August 2025) | 48.00% | $23.83 | No |
|
Taubench Airline
|
DeepSeek V3 (March 2025) | 44.00% | $5.43 | No |
|
Taubench Airline
|
Claude-3.7 Sonnet (February 2025) | 44.00% | $15.45 | No |
|
Taubench Airline
|
o4-mini Low (April 2025) | 36.00% | $7.14 | No |
|
Taubench Airline
|
GPT-4.1 (April 2025) | 36.00% | $8.18 | No |
|
Taubench Airline
|
DeepSeek R1 (January 2025) | 36.00% | $13.30 | No |
|
Taubench Airline
|
Gemini 2.0 Flash High (February 2025) | 28.00% | $0.31 | Yes |