Robustness on τ-bench (airline, clean)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Robustness Agg Fault Struct Prompt Overall
1 Claude Sonnet 4.5 78.5% 0.99 1.00 1.00 0.98 0.85
2 Gemini 2.0 Flash 44.6% 0.98 0.98 1.00 0.98 0.70
3 Gemini 3.0 Pro 80.8% 0.98 1.00 1.00 0.95 0.85
4 Claude 3.7 Sonnet 56.2% 0.97 1.00 1.00 0.91 0.75
5 GPT-5.2 (xhigh) 67.7% 0.96 1.00 1.00 0.89 0.81
6 Gemini 2.5 Flash 65.4% 0.95 1.00 1.00 0.86 0.73
7 GPT-5.2 59.2% 0.95 1.00 1.00 0.84 0.80
8 Claude Opus 4.5 83.1% 0.94 0.97 0.93 0.93 0.88
9 O1 72.3% 0.93 0.95 1.00 0.85 0.80
10 Gemini 2.5 Pro 73.8% 0.93 0.98 0.83 0.97 0.79
11 GPT-4 Turbo 50.0% 0.92 0.95 0.85 0.95 0.71
12 GPT-4o Mini 32.1% 0.91 1.00 0.72 1.00 0.69
13 Claude 3.5 Haiku 42.3% 0.85 0.81 1.00 0.74 0.68