Robustness on τ-bench (airline, clean)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Reliability Robustness Agg Fault Struct Prompt
1 Claude Sonnet 4.5 78.5% 0.85 0.99 1.00 1.00 0.98
2 Gemini 2.0 Flash 44.6% 0.70 0.98 0.98 1.00 0.98
3 Gemini 3.0 Pro 80.8% 0.85 0.98 1.00 1.00 0.95
4 Claude 3.7 Sonnet 56.2% 0.75 0.97 1.00 1.00 0.91
5 GPT-5.2 (xhigh) 67.7% 0.81 0.96 1.00 1.00 0.89
6 Gemini 2.5 Flash 65.4% 0.73 0.95 1.00 1.00 0.86
7 GPT-5.2 59.2% 0.80 0.95 1.00 1.00 0.84
8 Claude Opus 4.5 83.1% 0.88 0.94 0.97 0.93 0.93
9 O1 72.3% 0.80 0.93 0.95 1.00 0.85
10 Gemini 2.5 Pro 73.8% 0.79 0.93 0.98 0.83 0.97
11 GPT-4 Turbo 50.0% 0.71 0.92 0.95 0.85 0.95
12 GPT-4o Mini 32.1% 0.69 0.91 1.00 0.72 1.00
13 Claude 3.5 Haiku 42.3% 0.68 0.85 0.81 1.00 0.74