Consistency on τ-bench (airline, clean)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Consistency Agg Outc Traj-D Traj-S Res Overall
1 Claude Opus 4.5 83.1% 0.82 0.77 0.88 0.79 0.85 0.88
2 GPT-5.2 59.2% 0.76 0.65 0.86 0.76 0.83 0.80
3 Gemini 3.0 Pro 80.8% 0.76 0.65 0.85 0.76 0.82 0.85
4 GPT-4o Mini 32.1% 0.76 0.69 0.84 0.73 0.80 0.69
5 O1 72.3% 0.73 0.58 0.87 0.75 0.80 0.80
6 Claude Sonnet 4.5 78.5% 0.72 0.50 0.85 0.77 0.85 0.85
7 GPT-4 Turbo 50.0% 0.72 0.54 0.85 0.73 0.83 0.71
8 GPT-5.2 (xhigh) 67.7% 0.70 0.54 0.85 0.73 0.77 0.81
9 Gemini 2.5 Pro 73.8% 0.68 0.46 0.84 0.73 0.81 0.79
10 Claude 3.5 Haiku 42.3% 0.66 0.42 0.82 0.70 0.80 0.68
11 Gemini 2.0 Flash 44.6% 0.65 0.35 0.89 0.73 0.80 0.70
12 Claude 3.7 Sonnet 56.2% 0.65 0.35 0.84 0.74 0.81 0.75
13 Gemini 2.5 Flash 65.4% 0.62 0.31 0.85 0.72 0.77 0.73