Benchmark: τ-bench (airline, original)
τ-bench (airline, original) is the full 50-task version of the τ-bench airline benchmark. It evaluates tool-augmented conversational agents on realistic customer-service scenarios in an airline domain. Each task presents a simulated user with a specific request — such as rebooking a flight, changing a seat, or processing a refund — and the agent must converse with the user while calling backend API tools (e.g., flight search, booking modification) to resolve the request. Note that 24 of the 50 tasks have known issues in grading or task specification (e.g., incorrect answer keys, policy contradictions, ambiguous descriptions). For a cleaner evaluation, see τ-bench (airline) which excludes these problematic tasks.
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents GitHub Repository
Compare with curated subset (26 tasks with grading/specification issues removed) →
Reliability Trends
Agent Leaderboard
| # | Agent | Acc | Consistency | Predictability | Robustness | Safety | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agg | Outc | Traj-D | Traj-S | Res | Agg | Cal | AUROC | Brier | Agg | Fault | Struct | Prompt | Agg | Harm | Comp | ||||
| 1 | 58.4% | 0.79 | 0.68 | 0.88 | 0.78 | 0.85 | 0.69 | 0.71 | 0.67 | 0.69 | 0.93 | 0.95 | 0.96 | 0.89 | 0.97 | 0.46 | 0.94 | 0.80 | |
| 2 | 54.4% | 0.73 | 0.54 | 0.86 | 0.77 | 0.85 | 0.66 | 0.68 | 0.61 | 0.66 | 0.98 | 0.97 | 1.00 | 0.97 | 0.97 | 0.50 | 0.94 | 0.79 | |
| 3 | 58.8% | 0.76 | 0.62 | 0.86 | 0.77 | 0.84 | 0.60 | 0.60 | 0.52 | 0.60 | 0.94 | 1.00 | 0.92 | 0.91 | 0.96 | 0.21 | 0.95 | 0.77 | |
| 4 | 49.6% | 0.69 | 0.48 | 0.86 | 0.75 | 0.80 | 0.64 | 0.74 | 0.50 | 0.64 | 0.94 | 0.95 | 1.00 | 0.86 | 0.89 | 0.40 | 0.81 | 0.76 | |
| 5 | 51.6% | 0.67 | 0.44 | 0.85 | 0.76 | 0.76 | 0.65 | 0.68 | 0.62 | 0.65 | 0.95 | 1.00 | 1.00 | 0.84 | 0.94 | 0.43 | 0.89 | 0.76 | |
| 6 | 42.0% | 0.72 | 0.52 | 0.86 | 0.76 | 0.82 | 0.55 | 0.55 | 0.56 | 0.55 | 0.97 | 1.00 | 1.00 | 0.92 | 0.93 | 0.36 | 0.89 | 0.75 | |
| 7 | 52.8% | 0.69 | 0.48 | 0.84 | 0.72 | 0.80 | 0.56 | 0.56 | 0.58 | 0.56 | 0.96 | 0.97 | 0.98 | 0.92 | 0.89 | 0.40 | 0.81 | 0.74 | |
| 8 | 43.6% | 0.67 | 0.40 | 0.82 | 0.72 | 0.82 | 0.54 | 0.54 | 0.53 | 0.54 | 0.96 | 1.00 | 1.00 | 0.89 | 0.88 | 0.45 | 0.79 | 0.72 | |
| 9 | 47.2% | 0.64 | 0.38 | 0.85 | 0.70 | 0.76 | 0.52 | 0.52 | 0.54 | 0.52 | 0.95 | 1.00 | 0.97 | 0.88 | 0.89 | 0.39 | 0.82 | 0.70 | |
| 10 | 35.6% | 0.72 | 0.52 | 0.85 | 0.73 | 0.84 | 0.38 | 0.38 | 0.47 | 0.38 | 0.98 | 0.98 | 0.96 | 1.00 | 0.85 | 0.44 | 0.72 | 0.69 | |
| 11 | 29.6% | 0.70 | 0.54 | 0.83 | 0.71 | 0.80 | 0.46 | 0.45 | 0.44 | 0.46 | 0.88 | 0.86 | 1.00 | 0.77 | 0.77 | 0.41 | 0.61 | 0.68 | |
| 12 | 32.0% | 0.68 | 0.44 | 0.87 | 0.73 | 0.81 | 0.38 | 0.36 | 0.61 | 0.38 | 0.94 | 0.98 | 1.00 | 0.85 | 0.82 | 0.37 | 0.72 | 0.67 | |
| 13 | 21.3% | 0.76 | 0.72 | 0.83 | 0.72 | 0.80 | 0.32 | 0.29 | 0.48 | 0.32 | 0.92 | 1.00 | 0.84 | 0.91 | 0.76 | 0.41 | 0.59 | 0.67 | |