Consistency on GAIA

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Reliability Consistency Agg Outc Traj-D Traj-S Res
1 Claude Opus 4.5 68.5% 0.85 0.81 0.84 0.89 0.75 0.77
2 Claude Opus 4.7 73.3% 0.84 0.81 0.84 0.87 0.74 0.77
3 Claude 3.5 Haiku 25.7% 0.82 0.80 0.82 0.89 0.76 0.75
4 Claude Sonnet 4 54.7% 0.82 0.78 0.76 0.87 0.75 0.76
5 GPT-4 Turbo 30.8% 0.76 0.76 0.73 0.90 0.79 0.71
6 Gemini 3.1 Pro 76.2% 0.82 0.76 0.83 0.81 0.61 0.73
7 Gemini 3.5 Flash 79.2% 0.84 0.75 0.84 0.82 0.61 0.69
8 GPT-4o Mini 26.3% 0.76 0.73 0.75 0.87 0.72 0.65
9 Gemini 2.5 Pro 52.5% 0.78 0.72 0.73 0.87 0.71 0.65
10 Claude 3 Haiku 12.9% 0.69 0.71 0.84 0.78 0.55 0.61
11 Gemini 2.5 Flash 46.7% 0.74 0.70 0.73 0.82 0.64 0.64
12 O1 34.4% 0.79 0.68 0.58 0.85 0.78 0.65
13 GPT-5.2 (medium) 31.8% 0.72 0.62 0.65 0.74 0.56 0.56
14 GPT-5.5 62.8% 0.79 0.61 0.60 0.71 0.53 0.61
15 GPT-5.2 33.2% 0.72 0.59 0.58 0.77 0.54 0.54