TAU-bench Few Shot
Agent performance overview across all HAL benchmarks
1
Benchmarks
14
Models Used
3
Pareto Runs
Models Used
Claude Opus 4 High (May 2025)
Claude Opus 4.1 High (August 2025)
o4-mini High (April 2025)
Claude-3.7 Sonnet High (February 2025)
GPT-4.1 (April 2025)
Claude Opus 4 (May 2025)
Claude Opus 4.1 (August 2025)
GPT-5 Medium (August 2025)
o4-mini Low (April 2025)
o3 Medium (April 2025)
Gemini 2.0 Flash
DeepSeek R1
Claude-3.7 Sonnet (February 2025)
DeepSeek V3
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Taubench Airline
|
Claude Opus 4 High (May 2025) | 66.00% | $313.83 | Yes |
Taubench Airline
|
Claude Opus 4.1 High (August 2025) | 62.00% | $298.58 | No |
Taubench Airline
|
o4-mini High (April 2025) | 60.00% | $18.92 | Yes |
Taubench Airline
|
Claude-3.7 Sonnet High (February 2025) | 60.00% | $37.23 | No |
Taubench Airline
|
GPT-4.1 (April 2025) | 56.00% | $42.58 | No |
Taubench Airline
|
Claude Opus 4 (May 2025) | 56.00% | $363.30 | No |
Taubench Airline
|
Claude Opus 4.1 (August 2025) | 54.00% | $294.17 | No |
Taubench Airline
|
GPT-5 Medium (August 2025) | 52.00% | $35.49 | No |
Taubench Airline
|
o4-mini Low (April 2025) | 48.00% | $18.81 | No |
Taubench Airline
|
o3 Medium (April 2025) | 46.00% | $204.11 | No |
Taubench Airline
|
Gemini 2.0 Flash | 44.00% | $4.44 | Yes |
Taubench Airline
|
DeepSeek R1 | 36.00% | $31.75 | No |
Taubench Airline
|
Claude-3.7 Sonnet (February 2025) | 34.00% | $36.45 | No |
Taubench Airline
|
DeepSeek V3 | 34.00% | $30.60 | No |