TAU-bench Retail

TAU-bench is a benchmark for Tool-Agent-User Interaction in Real-World Domains. TAU-bench Retail evaluates AI agents on taks in the retail shopping domain, such as cancelling orders, address changes, and checking order status.

Paper: τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Yao et al., 2024)

115
Tasks in Public Test Set
7
Agents Evaluated

TAU-bench Retail Leaderboard

Rank Agent Models Verified Accuracy Cost (USD) Runs Traces
1 claude-3-7-sonnet-20250219 72.17% $45.89 1 Download
2 o1-2024-12-17 med. 71.30% $270.03 1 Download
3 gpt-4.5-preview-2025-02-27 70.43% $1135.22 1 Download
4 claude-3-5-sonnet-20241022 68.70% $41.32 1 Download
5 gpt-4o-2024-11-20 62.61% $42.95 1 Download
6 o3-mini-2025-01-31 med. 51.30% $28.66 1 Download
7 gpt-4o-mini-2024-07-18 43.48% $6.80 1 Download

Accuracy vs. Cost Frontier for TAU-bench Retail

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for TAU-bench Retail

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

claude-3-5-sonnet-20241022

Active
$
/1M tokens
$
/1M tokens

gpt-4o-2024-08-06

Active
$
/1M tokens
$
/1M tokens

claude-3-7-sonnet-20250219

Active
$
/1M tokens
$
/1M tokens

gpt-4.5-preview-2025-02-27

Active
$
/1M tokens
$
/1M tokens

gpt-4o-2024-11-20

Active
$
/1M tokens
$
/1M tokens

gpt-4o-mini-2024-07-18

Active
$
/1M tokens
$
/1M tokens

o1-2024-12-17

Active
$
/1M tokens
$
/1M tokens

o3-mini-2025-01-31

Active
$
/1M tokens
$
/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on TAU-bench Retail? Follow our guide to get started:

View Documentation

Task Details

Browse the complete TAU-bench Retail tasks:

View Tasks