Robustness on GAIA

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Reliability Robustness Agg Fault Struct Prompt
1 Gemini 2.5 Flash 37.8% 0.76 0.98 1.00 1.00 0.93
2 Gemini 2.5 Pro 50.1% 0.78 0.97 1.00 1.00 0.90
3 Claude Opus 4.5 71.5% 0.82 0.96 1.00 1.00 0.89
4 Claude Sonnet 4.5 74.7% 0.80 0.95 0.99 0.93 0.94
5 GPT-5.2 (medium) 42.6% 0.74 0.95 0.97 1.00 0.88
6 Claude 3.7 Sonnet 62.4% 0.77 0.92 0.93 0.95 0.87
7 GPT-5.2 29.9% 0.74 0.88 1.00 0.94 0.70
8 GPT-4o Mini 22.0% 0.73 0.87 0.81 0.97 0.84
9 GPT-4 Turbo 20.0% 0.76 0.81 0.87 0.76 0.82
10 Claude 3.5 Haiku 28.1% 0.74 0.79 0.96 0.78 0.63
11 Gemini 2.0 Flash 27.9% 0.70 0.76 0.88 0.67 0.73
12 O1 34.7% 0.72 0.72 0.77 0.78 0.60