Predictability on GAIA
How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →
Sub-metric Comparison
Calibration Curves
Claude 3 Haiku
Claude 3.5 Haiku
Claude 4 Sonnet
Claude 4.5 Opus
Claude 4.7 Opus
GPT 5.2
GPT 5.2 (medium)
GPT 5.5
GPT-4 Turbo
GPT-4o mini
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.1 Pro
Gemini 3.5 Flash
o1
Accuracy-Coverage Curves
Claude 3 Haiku
Claude 3.5 Haiku
Claude 4 Sonnet
Claude 4.5 Opus
Claude 4.7 Opus
GPT 5.2
GPT 5.2 (medium)
GPT 5.5
GPT-4 Turbo
GPT-4o mini
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.1 Pro
Gemini 3.5 Flash
o1
Agent Leaderboard — Predictability
| # | Agent | Acc | Reliability | Predictability Agg | Cal | AUROC | Brier |
|---|---|---|---|---|---|---|---|
| 1 | 68.5% | 0.85 | 0.84 | 0.97 | 0.80 | 0.84 | |
| 2 | 62.8% | 0.79 | 0.80 | 0.88 | 0.76 | 0.80 | |
| 3 | 79.2% | 0.84 | 0.80 | 0.79 | 0.57 | 0.80 | |
| 4 | 76.2% | 0.82 | 0.78 | 0.79 | 0.72 | 0.78 | |
| 5 | 33.2% | 0.72 | 0.78 | 0.81 | 0.80 | 0.78 | |
| 6 | 54.7% | 0.82 | 0.73 | 0.77 | 0.71 | 0.73 | |
| 7 | 73.3% | 0.84 | 0.73 | 0.76 | 0.63 | 0.73 | |
| 8 | 25.7% | 0.82 | 0.72 | 0.67 | 0.76 | 0.72 | |
| 9 | 52.5% | 0.78 | 0.71 | 0.71 | 0.75 | 0.71 | |
| 10 | 34.4% | 0.79 | 0.68 | 0.66 | 0.76 | 0.68 | |
| 11 | 46.7% | 0.74 | 0.64 | 0.63 | 0.77 | 0.64 | |
| 12 | 30.8% | 0.76 | 0.64 | 0.60 | 0.75 | 0.64 | |
| 13 | 31.8% | 0.72 | 0.61 | 0.61 | 0.61 | 0.61 | |
| 14 | 26.3% | 0.76 | 0.58 | 0.52 | 0.72 | 0.58 | |
| 15 | 12.9% | 0.69 | 0.49 | 0.38 | 0.73 | 0.49 |