Predictability on GAIA

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

Sub-metric Comparison
Calibration Curves
Claude 3 Haiku
Claude 3.5 Haiku
Claude 4 Sonnet
Claude 4.5 Opus
Claude 4.7 Opus
GPT 5.2
GPT 5.2 (medium)
GPT 5.5
GPT-4 Turbo
GPT-4o mini
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.1 Pro
Gemini 3.5 Flash
o1
Accuracy-Coverage Curves
Claude 3 Haiku
Claude 3.5 Haiku
Claude 4 Sonnet
Claude 4.5 Opus
Claude 4.7 Opus
GPT 5.2
GPT 5.2 (medium)
GPT 5.5
GPT-4 Turbo
GPT-4o mini
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.1 Pro
Gemini 3.5 Flash
o1
Agent Leaderboard — Predictability
# Agent Acc Reliability Predictability Agg Cal AUROC Brier
1 Claude Opus 4.5 68.5% 0.85 0.84 0.97 0.80 0.84
2 GPT-5.5 62.8% 0.79 0.80 0.88 0.76 0.80
3 Gemini 3.5 Flash 79.2% 0.84 0.80 0.79 0.57 0.80
4 Gemini 3.1 Pro 76.2% 0.82 0.78 0.79 0.72 0.78
5 GPT-5.2 33.2% 0.72 0.78 0.81 0.80 0.78
6 Claude Sonnet 4 54.7% 0.82 0.73 0.77 0.71 0.73
7 Claude Opus 4.7 73.3% 0.84 0.73 0.76 0.63 0.73
8 Claude 3.5 Haiku 25.7% 0.82 0.72 0.67 0.76 0.72
9 Gemini 2.5 Pro 52.5% 0.78 0.71 0.71 0.75 0.71
10 O1 34.4% 0.79 0.68 0.66 0.76 0.68
11 Gemini 2.5 Flash 46.7% 0.74 0.64 0.63 0.77 0.64
12 GPT-4 Turbo 30.8% 0.76 0.64 0.60 0.75 0.64
13 GPT-5.2 (medium) 31.8% 0.72 0.61 0.61 0.61 0.61
14 GPT-4o Mini 26.3% 0.76 0.58 0.52 0.72 0.58
15 Claude 3 Haiku 12.9% 0.69 0.49 0.38 0.73 0.49