TAU-bench Airline

TAU-bench is a benchmark for Tool-Agent-User Interaction in Real-World Domains. TAU-bench Airline evaluates AI agents on taks in the airline domain, such as changing filghts or finding new flights.

Paper: τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Yao et al., 2024)

50
Tasks in Public Test Set
7
Agents Evaluated

TAU-bench Airline Leaderboard

Rank Agent Models Verified Accuracy Cost (USD) Runs Traces
1 o1-2024-12-17 med. 54.00% $109.31 1 Download
2 gpt-4.5-preview-2025-02-27 46.00% $422.75 1 Download
3 claude-3-7-sonnet-20250219 46.00% $18.65 1 Download
4 claude-3-5-sonnet-20241022 44.00% $15.29 1 Download
5 gpt-4o-2024-11-20 44.00% $17.30 1 Download
6 o3-mini-2025-01-31 med. 38.00% $11.94 1 Download
7 gpt-4o-mini-2024-07-18 20.00% $2.71 1 Download

Accuracy vs. Cost Frontier for TAU-bench Airline

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for TAU-bench Airline

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

claude-3-5-sonnet-20241022

Active
$
/1M tokens
$
/1M tokens

gpt-4o-2024-08-06

Active
$
/1M tokens
$
/1M tokens

claude-3-7-sonnet-20250219

Active
$
/1M tokens
$
/1M tokens

gpt-4.5-preview-2025-02-27

Active
$
/1M tokens
$
/1M tokens

gpt-4o-2024-11-20

Active
$
/1M tokens
$
/1M tokens

gpt-4o-mini-2024-07-18

Active
$
/1M tokens
$
/1M tokens

o1-2024-12-17

Active
$
/1M tokens
$
/1M tokens

o3-mini-2025-01-31

Active
$
/1M tokens
$
/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on TAU-bench Airline? Follow our guide to get started:

View Documentation

Task Details

Browse the complete TAU-bench Airline tasks:

View Tasks