Predictability on τ-bench (airline, original)

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

Sub-metric Comparison
Calibration Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Accuracy-Coverage Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Agent Leaderboard — Predictability
# Agent Acc Reliability Predictability Agg Cal AUROC Brier
1 Claude Opus 4.5 58.4% 0.80 0.69 0.71 0.67 0.69
2 Claude Sonnet 4.5 54.4% 0.79 0.66 0.68 0.61 0.66
3 GPT-5.2 (xhigh) 51.6% 0.76 0.65 0.68 0.62 0.65
4 O1 49.6% 0.76 0.64 0.74 0.50 0.64
5 Gemini 3.0 Pro 58.8% 0.77 0.60 0.60 0.52 0.60
6 Gemini 2.5 Pro 52.8% 0.74 0.56 0.56 0.58 0.56
7 GPT-5.2 42.0% 0.75 0.55 0.55 0.56 0.55
8 Claude 3.7 Sonnet 43.6% 0.72 0.54 0.54 0.53 0.54
9 Gemini 2.5 Flash 47.2% 0.70 0.52 0.52 0.54 0.52
10 Claude 3.5 Haiku 29.6% 0.68 0.46 0.45 0.44 0.46
11 GPT-4 Turbo 35.6% 0.69 0.38 0.38 0.47 0.38
12 Gemini 2.0 Flash 32.0% 0.67 0.38 0.36 0.61 0.38
13 GPT-4o Mini 21.3% 0.67 0.32 0.29 0.48 0.32