Robustness on GAIA
How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →
Sub-metric Comparison
Baseline vs Perturbed Accuracy
Agent Leaderboard — Robustness
| # | Agent | Acc | Reliability | Robustness Agg | Fault | Struct | Prompt |
|---|---|---|---|---|---|---|---|
| 1 | 34.4% | 0.79 | 1.00 | 1.00 | 1.00 | 1.00 | |
| 2 | 73.3% | 0.84 | 0.99 | 1.00 | 1.00 | 0.96 | |
| 3 | 26.3% | 0.76 | 0.97 | 1.00 | 1.00 | 0.92 | |
| 4 | 79.2% | 0.84 | 0.96 | 1.00 | 1.00 | 0.88 | |
| 5 | 62.8% | 0.79 | 0.95 | 1.00 | 0.96 | 0.91 | |
| 6 | 54.7% | 0.82 | 0.94 | 1.00 | 1.00 | 0.83 | |
| 7 | 25.7% | 0.82 | 0.94 | 1.00 | 1.00 | 0.82 | |
| 8 | 76.2% | 0.82 | 0.94 | 0.99 | 0.96 | 0.86 | |
| 9 | 31.8% | 0.72 | 0.91 | 1.00 | 1.00 | 0.74 | |
| 10 | 52.5% | 0.78 | 0.91 | 0.95 | 0.97 | 0.82 | |
| 11 | 68.5% | 0.85 | 0.91 | 0.99 | 0.98 | 0.77 | |
| 12 | 46.7% | 0.74 | 0.87 | 0.97 | 0.96 | 0.69 | |
| 13 | 30.8% | 0.76 | 0.87 | 1.00 | 0.91 | 0.71 | |
| 14 | 12.9% | 0.69 | 0.86 | 1.00 | 0.94 | 0.66 | |
| 15 | 33.2% | 0.72 | 0.78 | 0.62 | 1.00 | 0.73 |