Predictability on GAIA

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

Sub-metric Comparison
Calibration Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (medium)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
o1
Accuracy-Coverage Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (medium)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
o1
Agent Leaderboard — Predictability
# Agent Acc Predictability Agg Cal AUROC Brier Overall
1 Claude Opus 4.5 71.5% 0.82 0.92 0.72 0.82 0.82
2 Claude Sonnet 4.5 74.7% 0.81 0.91 0.66 0.81 0.80
3 Claude 3.7 Sonnet 62.4% 0.77 0.87 0.67 0.77 0.77
4 GPT-4 Turbo 20.0% 0.75 0.69 0.84 0.75 0.76
5 Gemini 2.5 Pro 50.1% 0.74 0.77 0.75 0.74 0.78
6 O1 34.7% 0.74 0.70 0.82 0.74 0.72
7 GPT-5.2 29.9% 0.72 0.74 0.73 0.72 0.74
8 Gemini 2.5 Flash 37.8% 0.72 0.72 0.83 0.72 0.76
9 Gemini 2.0 Flash 27.9% 0.72 0.66 0.83 0.72 0.70
10 Claude 3.5 Haiku 28.1% 0.72 0.70 0.72 0.72 0.74
11 GPT-5.2 (medium) 42.6% 0.70 0.74 0.65 0.70 0.74
12 GPT-4o Mini 22.0% 0.69 0.60 0.79 0.69 0.73