Robustness on GAIA

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Robustness Agg Fault Struct Prompt Overall
1 Gemini 2.5 Flash 37.8% 0.98 1.00 1.00 0.93 0.76
2 Gemini 2.5 Pro 50.1% 0.97 1.00 1.00 0.90 0.78
3 Claude Opus 4.5 71.5% 0.96 1.00 1.00 0.89 0.82
4 Claude Sonnet 4.5 74.7% 0.95 0.99 0.93 0.94 0.80
5 GPT-5.2 (medium) 42.6% 0.95 0.97 1.00 0.88 0.74
6 Claude 3.7 Sonnet 62.4% 0.92 0.93 0.95 0.87 0.77
7 GPT-5.2 29.9% 0.88 1.00 0.94 0.70 0.74
8 GPT-4o Mini 22.0% 0.87 0.81 0.97 0.84 0.73
9 GPT-4 Turbo 20.0% 0.81 0.87 0.76 0.82 0.76
10 Claude 3.5 Haiku 28.1% 0.79 0.96 0.78 0.63 0.74
11 Gemini 2.0 Flash 27.9% 0.76 0.88 0.67 0.73 0.70
12 O1 34.7% 0.72 0.77 0.78 0.60 0.72