Robustness on τ-bench (airline, original)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Robustness Agg Fault Struct Prompt Overall
1 Claude Sonnet 4.5 54.4% 0.98 0.97 1.00 0.97 0.79
2 GPT-4 Turbo 35.6% 0.98 0.98 0.96 1.00 0.69
3 GPT-5.2 42.0% 0.97 1.00 1.00 0.92 0.75
4 Claude 3.7 Sonnet 43.6% 0.96 1.00 1.00 0.89 0.72
5 Gemini 2.5 Pro 52.8% 0.96 0.97 0.98 0.92 0.74
6 Gemini 2.5 Flash 47.2% 0.95 1.00 0.97 0.88 0.70
7 GPT-5.2 (xhigh) 51.6% 0.95 1.00 1.00 0.84 0.76
8 Gemini 2.0 Flash 32.0% 0.94 0.98 1.00 0.85 0.67
9 Gemini 3.0 Pro 58.8% 0.94 1.00 0.92 0.91 0.77
10 O1 49.6% 0.94 0.95 1.00 0.86 0.76
11 Claude Opus 4.5 58.4% 0.93 0.95 0.96 0.89 0.80
12 GPT-4o Mini 21.3% 0.92 1.00 0.84 0.91 0.67
13 Claude 3.5 Haiku 29.6% 0.88 0.86 1.00 0.77 0.68