Robustness on τ-bench (airline, original)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Reliability Robustness Agg Fault Struct Prompt
1 Claude Sonnet 4.5 54.4% 0.79 0.98 0.97 1.00 0.97
2 GPT-4 Turbo 35.6% 0.69 0.98 0.98 0.96 1.00
3 GPT-5.2 42.0% 0.75 0.97 1.00 1.00 0.92
4 Claude 3.7 Sonnet 43.6% 0.72 0.96 1.00 1.00 0.89
5 Gemini 2.5 Pro 52.8% 0.74 0.96 0.97 0.98 0.92
6 Gemini 2.5 Flash 47.2% 0.70 0.95 1.00 0.97 0.88
7 GPT-5.2 (xhigh) 51.6% 0.76 0.95 1.00 1.00 0.84
8 Gemini 2.0 Flash 32.0% 0.67 0.94 0.98 1.00 0.85
9 Gemini 3.0 Pro 58.8% 0.77 0.94 1.00 0.92 0.91
10 O1 49.6% 0.76 0.94 0.95 1.00 0.86
11 Claude Opus 4.5 58.4% 0.80 0.93 0.95 0.96 0.89
12 GPT-4o Mini 21.3% 0.67 0.92 1.00 0.84 0.91
13 Claude 3.5 Haiku 29.6% 0.68 0.88 0.86 1.00 0.77