Consistency on GAIA

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Reliability Consistency Agg Outc Traj-D Traj-S Res
1 O1 34.7% 0.72 0.72 0.70 0.80 0.72 0.69
2 GPT-4 Turbo 20.0% 0.76 0.70 0.72 0.78 0.65 0.68
3 Claude 3.5 Haiku 28.1% 0.74 0.70 0.64 0.83 0.73 0.68
4 Claude Opus 4.5 71.5% 0.82 0.67 0.70 0.72 0.54 0.68
5 GPT-4o Mini 22.0% 0.73 0.63 0.64 0.66 0.54 0.66
6 Claude Sonnet 4.5 74.7% 0.80 0.63 0.64 0.69 0.49 0.66
7 Gemini 2.5 Pro 50.1% 0.78 0.62 0.60 0.74 0.57 0.62
8 Claude 3.7 Sonnet 62.4% 0.77 0.62 0.64 0.71 0.54 0.60
9 GPT-5.2 29.9% 0.74 0.62 0.58 0.76 0.56 0.61
10 Gemini 2.0 Flash 27.9% 0.70 0.61 0.60 0.76 0.60 0.54
11 Gemini 2.5 Flash 37.8% 0.76 0.58 0.52 0.69 0.57 0.60
12 GPT-5.2 (medium) 42.6% 0.74 0.58 0.55 0.70 0.50 0.59