Predictability on τ-bench (airline, clean)
How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →
Sub-metric Comparison
Calibration Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Accuracy-Coverage Curves
Claude 3.5 Haiku
Claude 3.7 Sonnet
Claude 4.5 Opus
Claude 4.5 Sonnet
GPT 5.2
GPT 5.2 (xhigh)
GPT-4 Turbo
GPT-4o mini
Gemini 2.0 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
Gemini 3.0 Pro
o1
Agent Leaderboard — Predictability
| # | Agent | Acc | Predictability Agg | Cal | AUROC | Brier | Overall |
|---|---|---|---|---|---|---|---|
| 1 | 83.1% | 0.87 | 0.93 | 0.68 | 0.87 | 0.88 | |
| 2 | 78.5% | 0.84 | 0.90 | 0.68 | 0.84 | 0.85 | |
| 3 | 80.8% | 0.81 | 0.82 | 0.52 | 0.81 | 0.85 | |
| 4 | 67.7% | 0.78 | 0.81 | 0.75 | 0.78 | 0.81 | |
| 5 | 73.8% | 0.77 | 0.77 | 0.70 | 0.77 | 0.79 | |
| 6 | 72.3% | 0.75 | 0.77 | 0.45 | 0.75 | 0.80 | |
| 7 | 59.2% | 0.68 | 0.71 | 0.62 | 0.68 | 0.80 | |
| 8 | 56.2% | 0.63 | 0.65 | 0.48 | 0.63 | 0.75 | |
| 9 | 65.4% | 0.61 | 0.62 | 0.53 | 0.61 | 0.73 | |
| 10 | 42.3% | 0.53 | 0.53 | 0.42 | 0.53 | 0.68 | |
| 11 | 50.0% | 0.51 | 0.52 | 0.45 | 0.51 | 0.71 | |
| 12 | 44.6% | 0.48 | 0.46 | 0.56 | 0.48 | 0.70 | |
| 13 | 32.1% | 0.41 | 0.39 | 0.48 | 0.41 | 0.69 |