Consistency on GAIA
How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →
Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency
Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.
Agent Leaderboard — Consistency
| # | Agent | Acc | Consistency Agg | Outc | Traj-D | Traj-S | Res | Overall |
|---|---|---|---|---|---|---|---|---|
| 1 | 34.7% | 0.72 | 0.70 | 0.80 | 0.72 | 0.69 | 0.72 | |
| 2 | 20.0% | 0.70 | 0.72 | 0.78 | 0.65 | 0.68 | 0.76 | |
| 3 | 28.1% | 0.70 | 0.64 | 0.83 | 0.73 | 0.68 | 0.74 | |
| 4 | 71.5% | 0.67 | 0.70 | 0.72 | 0.54 | 0.68 | 0.82 | |
| 5 | 22.0% | 0.63 | 0.64 | 0.66 | 0.54 | 0.66 | 0.73 | |
| 6 | 74.7% | 0.63 | 0.64 | 0.69 | 0.49 | 0.66 | 0.80 | |
| 7 | 50.1% | 0.62 | 0.60 | 0.74 | 0.57 | 0.62 | 0.78 | |
| 8 | 62.4% | 0.62 | 0.64 | 0.71 | 0.54 | 0.60 | 0.77 | |
| 9 | 29.9% | 0.62 | 0.58 | 0.76 | 0.56 | 0.61 | 0.74 | |
| 10 | 27.9% | 0.61 | 0.60 | 0.76 | 0.60 | 0.54 | 0.70 | |
| 11 | 37.8% | 0.58 | 0.52 | 0.69 | 0.57 | 0.60 | 0.76 | |
| 12 | 42.6% | 0.58 | 0.55 | 0.70 | 0.50 | 0.59 | 0.74 |