TAU-bench Tool Calling

Agent performance overview across all HAL benchmarks

1
Benchmarks
12
Models Used
2
Pareto Optimal Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Taubench Airline
o4-mini High (April 2025) 56.00% $11.36 Yes
Taubench Airline
o3 Medium (April 2025) 54.00% $14.56 No
Taubench Airline
Claude-3.7 Sonnet High (February 2025) 52.00% $31.94 No
Taubench Airline
Claude Opus 4.1 High (August 2025) 52.00% $149.98 No
Taubench Airline
Claude Opus 4.1 (August 2025) 50.00% $69.78 No
Taubench Airline
GPT-5 Medium (August 2025) 48.00% $23.83 No
Taubench Airline
DeepSeek V3 (March 2025) 44.00% $5.43 No
Taubench Airline
Claude-3.7 Sonnet (February 2025) 44.00% $15.45 No
Taubench Airline
o4-mini Low (April 2025) 36.00% $7.14 No
Taubench Airline
GPT-4.1 (April 2025) 36.00% $8.18 No
Taubench Airline
DeepSeek R1 (January 2025) 36.00% $13.30 No
Taubench Airline
Gemini 2.0 Flash High (February 2025) 28.00% $0.31 Yes