Predictability on τ-bench (airline, original)

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

Sub-metric Comparison
Calibration Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Accuracy-Coverage Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Agent Leaderboard — Predictability
# Agent Acc Predictability Agg Cal AUROC Brier Overall
1 Claude Opus 4.5 58.4% 0.69 0.71 0.67 0.69 0.80
2 Claude Sonnet 4.5 54.4% 0.66 0.68 0.61 0.66 0.79
3 GPT-5.2 (xhigh) 51.6% 0.65 0.68 0.62 0.65 0.76
4 O1 49.6% 0.64 0.74 0.50 0.64 0.76
5 Gemini 3.0 Pro 58.8% 0.60 0.60 0.52 0.60 0.77
6 Gemini 2.5 Pro 52.8% 0.56 0.56 0.58 0.56 0.74
7 GPT-5.2 42.0% 0.55 0.55 0.56 0.55 0.75
8 Claude 3.7 Sonnet 43.6% 0.54 0.54 0.53 0.54 0.72
9 Gemini 2.5 Flash 47.2% 0.52 0.52 0.54 0.52 0.70
10 Claude 3.5 Haiku 29.6% 0.46 0.45 0.44 0.46 0.68
11 GPT-4 Turbo 35.6% 0.38 0.38 0.47 0.38 0.69
12 Gemini 2.0 Flash 32.0% 0.38 0.36 0.61 0.38 0.67
13 GPT-4o Mini 21.3% 0.32 0.29 0.48 0.32 0.67