Robustness on GAIA

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
# Agent Acc Reliability Robustness Agg Fault Struct Prompt
1 O1 34.4% 0.79 1.00 1.00 1.00 1.00
2 Claude Opus 4.7 73.3% 0.84 0.99 1.00 1.00 0.96
3 GPT-4o Mini 26.3% 0.76 0.97 1.00 1.00 0.92
4 Gemini 3.5 Flash 79.2% 0.84 0.96 1.00 1.00 0.88
5 GPT-5.5 62.8% 0.79 0.95 1.00 0.96 0.91
6 Claude Sonnet 4 54.7% 0.82 0.94 1.00 1.00 0.83
7 Claude 3.5 Haiku 25.7% 0.82 0.94 1.00 1.00 0.82
8 Gemini 3.1 Pro 76.2% 0.82 0.94 0.99 0.96 0.86
9 GPT-5.2 (medium) 31.8% 0.72 0.91 1.00 1.00 0.74
10 Gemini 2.5 Pro 52.5% 0.78 0.91 0.95 0.97 0.82
11 Claude Opus 4.5 68.5% 0.85 0.91 0.99 0.98 0.77
12 Gemini 2.5 Flash 46.7% 0.74 0.87 0.97 0.96 0.69
13 GPT-4 Turbo 30.8% 0.76 0.87 1.00 0.91 0.71
14 Claude 3 Haiku 12.9% 0.69 0.86 1.00 0.94 0.66
15 GPT-5.2 33.2% 0.72 0.78 0.62 1.00 0.73