Predictability on τ-bench (airline, clean)

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

Sub-metric Comparison
Calibration Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Accuracy-Coverage Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Agent Leaderboard — Predictability
# Agent Acc Reliability Predictability Agg Cal AUROC Brier
1 Claude Opus 4.5 83.1% 0.88 0.87 0.93 0.68 0.87
2 Claude Sonnet 4.5 78.5% 0.85 0.84 0.90 0.68 0.84
3 Gemini 3.0 Pro 80.8% 0.85 0.81 0.82 0.52 0.81
4 GPT-5.2 (xhigh) 67.7% 0.81 0.78 0.81 0.75 0.78
5 Gemini 2.5 Pro 73.8% 0.79 0.77 0.77 0.70 0.77
6 O1 72.3% 0.80 0.75 0.77 0.45 0.75
7 GPT-5.2 59.2% 0.80 0.68 0.71 0.62 0.68
8 Claude 3.7 Sonnet 56.2% 0.75 0.63 0.65 0.48 0.63
9 Gemini 2.5 Flash 65.4% 0.73 0.61 0.62 0.53 0.61
10 Claude 3.5 Haiku 42.3% 0.68 0.53 0.53 0.42 0.53
11 GPT-4 Turbo 50.0% 0.71 0.51 0.52 0.45 0.51
12 Gemini 2.0 Flash 44.6% 0.70 0.48 0.46 0.56 0.48
13 GPT-4o Mini 32.1% 0.69 0.41 0.39 0.48 0.41