Robustness on τ-bench (airline, clean)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Reliability Robustness Agg Fault Struct Prompt
1 Claude 3 Haiku 20.5% 0.70 0.98 1.00 0.94 1.00
2 O1 66.2% 0.81 0.98 1.00 1.00 0.93
3 Gemini 3.5 Flash 80.8% 0.86 0.97 1.00 1.00 0.92
4 GPT-5.5 79.5% 0.89 0.97 1.00 0.92 1.00
5 Claude Opus 4.5 80.8% 0.88 0.97 1.00 0.98 0.92
6 Claude Sonnet 4 78.2% 0.86 0.96 0.99 0.98 0.92
7 GPT-5.2 (medium) 67.9% 0.82 0.95 1.00 0.85 1.00
8 Gemini 3.1 Pro 82.1% 0.86 0.94 0.98 1.00 0.84
9 Gemini 2.5 Pro 71.8% 0.81 0.93 1.00 1.00 0.79
10 Claude 3.5 Haiku 29.5% 0.71 0.93 1.00 0.78 1.00
11 GPT-4o Mini 29.5% 0.67 0.93 1.00 0.91 0.87
12 GPT-5.2 60.3% 0.81 0.92 1.00 0.89 0.87
13 Claude Opus 4.7 84.6% 0.89 0.91 0.91 1.00 0.83
14 Gemini 2.5 Flash 59.0% 0.72 0.82 0.78 0.85 0.83
15 GPT-4 Turbo 57.7% 0.72 0.81 0.83 0.73 0.87