TAU-bench Few Shot

Agent performance overview across all HAL benchmarks

1
Benchmarks
14
Models Used
3
Pareto Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Taubench Airline
Claude Opus 4 High (May 2025) 66.00% $313.83 Yes
Taubench Airline
Claude Opus 4.1 High (August 2025) 62.00% $298.58 No
Taubench Airline
o4-mini High (April 2025) 60.00% $18.92 Yes
Taubench Airline
Claude-3.7 Sonnet High (February 2025) 60.00% $37.23 No
Taubench Airline
GPT-4.1 (April 2025) 56.00% $42.58 No
Taubench Airline
Claude Opus 4 (May 2025) 56.00% $363.30 No
Taubench Airline
Claude Opus 4.1 (August 2025) 54.00% $294.17 No
Taubench Airline
GPT-5 Medium (August 2025) 52.00% $35.49 No
Taubench Airline
o4-mini Low (April 2025) 48.00% $18.81 No
Taubench Airline
o3 Medium (April 2025) 46.00% $204.11 No
Taubench Airline
Gemini 2.0 Flash 44.00% $4.44 Yes
Taubench Airline
DeepSeek R1 36.00% $31.75 No
Taubench Airline
Claude-3.7 Sonnet (February 2025) 34.00% $36.45 No
Taubench Airline
DeepSeek V3 34.00% $30.60 No