Benchmark: τ-bench (airline, clean)
τ-bench (airline, clean) evaluates tool-augmented conversational agents on realistic customer-service scenarios set in an airline domain. This is a curated 26-task subset of the original 50-task benchmark, excluding tasks with identified issues in grading or task specification — such as incorrect answer keys (e.g., type errors in expected values, wrong passenger details), answer keys that contradict the policy instructions (e.g., cancelling flights that policy forbids cancelling, issuing certificates without required preconditions), ambiguous or underspecified task descriptions, and tasks referencing past dates that make the requested actions impossible. Each task presents a simulated user with a specific request and the agent must converse while calling backend API tools to resolve it. Tasks vary in complexity from simple single-action lookups to multi-turn dialogues requiring policy adherence and disambiguation. All metrics are computed from scratch on this curated subset, providing a more reliable estimate of agent performance and reliability. This clean version is used in the main results and aggregate scores.
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents GitHub Repository
Compare with original dataset (all 50 tasks) →
Reliability Trends
Agent Leaderboard
| # | Agent | Acc | Consistency | Predictability | Robustness | Safety | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agg | Outc | Traj-D | Traj-S | Res | Agg | Cal | AUROC | Brier | Agg | Fault | Struct | Prompt | Agg | Harm | Comp | ||||
| 1 | 83.1% | 0.82 | 0.77 | 0.88 | 0.79 | 0.85 | 0.87 | 0.93 | 0.68 | 0.87 | 0.94 | 0.97 | 0.93 | 0.93 | 0.99 | 0.50 | 0.98 | 0.88 | |
| 2 | 78.5% | 0.72 | 0.50 | 0.85 | 0.77 | 0.85 | 0.84 | 0.90 | 0.68 | 0.84 | 0.99 | 1.00 | 1.00 | 0.98 | 1.00 | 0.50 | 0.99 | 0.85 | |
| 3 | 80.8% | 0.76 | 0.65 | 0.85 | 0.76 | 0.82 | 0.81 | 0.82 | 0.52 | 0.81 | 0.98 | 1.00 | 1.00 | 0.95 | 0.98 | 0.25 | 0.97 | 0.85 | |
| 4 | 67.7% | 0.70 | 0.54 | 0.85 | 0.73 | 0.77 | 0.78 | 0.81 | 0.75 | 0.78 | 0.96 | 1.00 | 1.00 | 0.89 | 0.95 | 0.40 | 0.92 | 0.81 | |
| 5 | 72.3% | 0.73 | 0.58 | 0.87 | 0.75 | 0.80 | 0.75 | 0.77 | 0.45 | 0.75 | 0.93 | 0.95 | 1.00 | 0.85 | 0.93 | 0.50 | 0.86 | 0.80 | |
| 6 | 59.2% | 0.76 | 0.65 | 0.86 | 0.76 | 0.83 | 0.68 | 0.71 | 0.62 | 0.68 | 0.95 | 1.00 | 1.00 | 0.84 | 0.95 | 0.30 | 0.92 | 0.80 | |
| 7 | 73.8% | 0.68 | 0.46 | 0.84 | 0.73 | 0.81 | 0.77 | 0.77 | 0.70 | 0.77 | 0.93 | 0.98 | 0.83 | 0.97 | 0.93 | 0.40 | 0.88 | 0.79 | |
| 8 | 56.2% | 0.65 | 0.35 | 0.84 | 0.74 | 0.81 | 0.63 | 0.65 | 0.48 | 0.63 | 0.97 | 1.00 | 1.00 | 0.91 | 0.90 | 0.46 | 0.82 | 0.75 | |
| 9 | 65.4% | 0.62 | 0.31 | 0.85 | 0.72 | 0.77 | 0.61 | 0.62 | 0.53 | 0.61 | 0.95 | 1.00 | 1.00 | 0.86 | 0.93 | 0.37 | 0.88 | 0.73 | |
| 10 | 50.0% | 0.72 | 0.54 | 0.85 | 0.73 | 0.83 | 0.51 | 0.52 | 0.45 | 0.51 | 0.92 | 0.95 | 0.85 | 0.95 | 0.87 | 0.43 | 0.78 | 0.71 | |
| 11 | 44.6% | 0.65 | 0.35 | 0.89 | 0.73 | 0.80 | 0.48 | 0.46 | 0.56 | 0.48 | 0.98 | 0.98 | 1.00 | 0.98 | 0.87 | 0.41 | 0.78 | 0.70 | |
| 12 | 32.1% | 0.76 | 0.69 | 0.84 | 0.73 | 0.80 | 0.41 | 0.39 | 0.48 | 0.41 | 0.91 | 1.00 | 0.72 | 1.00 | 0.81 | 0.40 | 0.69 | 0.69 | |
| 13 | 42.3% | 0.66 | 0.42 | 0.82 | 0.70 | 0.80 | 0.53 | 0.53 | 0.42 | 0.53 | 0.85 | 0.81 | 1.00 | 0.74 | 0.81 | 0.42 | 0.67 | 0.68 | |