Consistency on GAIA

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Consistency Agg Outc Traj-D Traj-S Res Overall
1 O1 34.7% 0.72 0.70 0.80 0.72 0.69 0.72
2 GPT-4 Turbo 20.0% 0.70 0.72 0.78 0.65 0.68 0.76
3 Claude 3.5 Haiku 28.1% 0.70 0.64 0.83 0.73 0.68 0.74
4 Claude Opus 4.5 71.5% 0.67 0.70 0.72 0.54 0.68 0.82
5 GPT-4o Mini 22.0% 0.63 0.64 0.66 0.54 0.66 0.73
6 Claude Sonnet 4.5 74.7% 0.63 0.64 0.69 0.49 0.66 0.80
7 Gemini 2.5 Pro 50.1% 0.62 0.60 0.74 0.57 0.62 0.78
8 Claude 3.7 Sonnet 62.4% 0.62 0.64 0.71 0.54 0.60 0.77
9 GPT-5.2 29.9% 0.62 0.58 0.76 0.56 0.61 0.74
10 Gemini 2.0 Flash 27.9% 0.61 0.60 0.76 0.60 0.54 0.70
11 Gemini 2.5 Flash 37.8% 0.58 0.52 0.69 0.57 0.60 0.76
12 GPT-5.2 (medium) 42.6% 0.58 0.55 0.70 0.50 0.59 0.74