Predictability on τ-bench (airline, clean)

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

Sub-metric Comparison
Calibration Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Accuracy-Coverage Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Agent Leaderboard — Predictability
# Agent Acc Predictability Agg Cal AUROC Brier Overall
1 Claude Opus 4.5 83.1% 0.87 0.93 0.68 0.87 0.88
2 Claude Sonnet 4.5 78.5% 0.84 0.90 0.68 0.84 0.85
3 Gemini 3.0 Pro 80.8% 0.81 0.82 0.52 0.81 0.85
4 GPT-5.2 (xhigh) 67.7% 0.78 0.81 0.75 0.78 0.81
5 Gemini 2.5 Pro 73.8% 0.77 0.77 0.70 0.77 0.79
6 O1 72.3% 0.75 0.77 0.45 0.75 0.80
7 GPT-5.2 59.2% 0.68 0.71 0.62 0.68 0.80
8 Claude 3.7 Sonnet 56.2% 0.63 0.65 0.48 0.63 0.75
9 Gemini 2.5 Flash 65.4% 0.61 0.62 0.53 0.61 0.73
10 Claude 3.5 Haiku 42.3% 0.53 0.53 0.42 0.53 0.68
11 GPT-4 Turbo 50.0% 0.51 0.52 0.45 0.51 0.71
12 Gemini 2.0 Flash 44.6% 0.48 0.46 0.56 0.48 0.70
13 GPT-4o Mini 32.1% 0.41 0.39 0.48 0.41 0.69